Rate This Document
Findability
Accuracy
Completeness
Readability

Python and C Programs Stop Abnormally

Fault Locating

The Python and C/C++ hybrid compilation process is not executed as designed and stops abnormally.

  1. Compile a test script that can reproduce the fault.
  2. Print job failure logs.
  3. Locate the source code based on the logs and analyze the error cause.
  4. Modify the code. Compile the code again to verify the modification.
  5. If the fault is rectified, integrate the modification into the code.

Case: The Main Process Stops Abnormally Because a New Thread Cannot Be Started

Symptom

In a cluster environment, a customer reports that more than 40% of Python analysis jobs fail when Huawei Donau Scheduler is used to submit a large number of Python analysis jobs.

Fault Locating

  1. Compile a test script to reproduce the fault.

    Compile and execute the test script mol_map_label_dsub_vs1_chain_pa_hzl.sh.

    1
    2
    cd exp/cryonet2/
    exp/label_data/mol_map_label_dsub_vs1_chain_pa_hzl.sh ~/data/chains8_testhzl/1.list >ID_1000.list
    

    Out of a total of 1000 jobs, 600 jobs are successfully executed and the success rate is 60%. Run the following command to write the execution results to the djob-s-all-1000.log file:

    1
    djob -st "2021/12/11 10:44:52" -et "2021/12/11 10:55:00" -s all -n default > djob-s-all-1000.log
    

    View the running result.

    1
    cat djob-s-all-1000.log | grep SUCCEEDED | wc -l
    

    View the ID of the submitted task.

    1
    cat ID_1000.list | grep submit | wc –l
    

  2. Run the script and view the failure logs.

  3. It is found that OpenBLAS cannot create new threads. Run the following command. The output shows that 10816 processes or threads are used.
    1
    ps -ef -T |grep sync360  | grep mid  | wc -l
    

    OpenBLAS is a highly optimized linear algebra library. It accelerates computing by starting multiple threads. The number of started threads by default is the same as the current physical cores.

    The current compute node has 104 cores (2 CPUs, 52 cores for each CPU). Therefore, 104 threads are started for a single task.

    However, -R cpu=1 is specified in the task submission command dsub, indicating that only one compute core is applied for. As a result, 104 tasks are allocated to the same compute node. Therefore, OpenBLAS creates 104 x 104 = 10816 threads, exceeding the resource limit of Linux threads.

    The default values are as follows:

    ulimit -a
    core file size          (blocks, -c) 0
    data seg size           (kbytes, -d) unlimited
    scheduling priority             (-e) 0
    file size               (blocks, -f) unlimited
    pending signals                 (-i) 511844
    max locked memory       (kbytes, -l) 64
    max memory size         (kbytes, -m) unlimited
    open files                      (-n) 1024
    pipe size            (512 bytes, -p) 8
    POSIX message queues     (bytes, -q) 819200
    real-time priority              (-r) 0
    stack size              (kbytes, -s) 8192
    cpu time               (seconds, -t) unlimited
    max user processes              (-u) 4096
    virtual memory          (kbytes, -v) unlimited
    file locks                      (-x) unlimited

    The preceding information is the default configuration for common Linux users. The default maximum number of threads is 4096.

  4. Add the OMP_NUM_THREADS=1 command to limit the number of blas threads for each job to match the number of CPU cores applied for. After the modification, none of the 40,000+ batch jobs fails.
     OMP_NUM_THREADS=1 python tools.py -f label_mol_map_atom1437 -p 2c7c_B -o data/chains7/2c7c_B.npz --res 4 --voxel_size 1.0