Python and C Programs Stop Abnormally
Fault Locating
The Python and C/C++ hybrid compilation process is not executed as designed and stops abnormally.
- Compile a test script that can reproduce the fault.
- Print job failure logs.
- Locate the source code based on the logs and analyze the error cause.
- Modify the code. Compile the code again to verify the modification.
- If the fault is rectified, integrate the modification into the code.
Case: The Main Process Stops Abnormally Because a New Thread Cannot Be Started
Symptom
In a cluster environment, a customer reports that more than 40% of Python analysis jobs fail when Huawei Donau Scheduler is used to submit a large number of Python analysis jobs.
Fault Locating
- Compile a test script to reproduce the fault.
Compile and execute the test script mol_map_label_dsub_vs1_chain_pa_hzl.sh.
1 2
cd exp/cryonet2/ exp/label_data/mol_map_label_dsub_vs1_chain_pa_hzl.sh ~/data/chains8_testhzl/1.list >ID_1000.list
Out of a total of 1000 jobs, 600 jobs are successfully executed and the success rate is 60%. Run the following command to write the execution results to the djob-s-all-1000.log file:
1djob -st "2021/12/11 10:44:52" -et "2021/12/11 10:55:00" -s all -n default > djob-s-all-1000.log
View the running result.
1cat djob-s-all-1000.log | grep SUCCEEDED | wc -l
View the ID of the submitted task.
1cat ID_1000.list | grep submit | wc –l

- Run the script and view the failure logs.

- It is found that OpenBLAS cannot create new threads. Run the following command. The output shows that 10816 processes or threads are used.
1ps -ef -T |grep sync360 | grep mid | wc -l
OpenBLAS is a highly optimized linear algebra library. It accelerates computing by starting multiple threads. The number of started threads by default is the same as the current physical cores.
The current compute node has 104 cores (2 CPUs, 52 cores for each CPU). Therefore, 104 threads are started for a single task.
However, -R cpu=1 is specified in the task submission command dsub, indicating that only one compute core is applied for. As a result, 104 tasks are allocated to the same compute node. Therefore, OpenBLAS creates 104 x 104 = 10816 threads, exceeding the resource limit of Linux threads.
The default values are as follows:
ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 511844 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 4096 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
The preceding information is the default configuration for common Linux users. The default maximum number of threads is 4096.
- Add the OMP_NUM_THREADS=1 command to limit the number of blas threads for each job to match the number of CPU cores applied for. After the modification, none of the 40,000+ batch jobs fails.
OMP_NUM_THREADS=1 python tools.py -f label_mol_map_atom1437 -p 2c7c_B -o data/chains7/2c7c_B.npz --res 4 --voxel_size 1.0