Tuning the MPI Parameters
Principle
Message Passing Interface (MPI) is a standardized and portable message-passing standard designed for parallel computing architectures. The standard involves functions such as point-to-point communication, cluster operations, process groups, communication context, process topology, environment management and investigation, process creation and management.
Hyper MPI is recommended. Hyper MPI, which is based on Open MPI, integrates the Open UCX P2P communication framework and UCX COLL collective communication optimization framework, and optimizes the algorithm acceleration library in the integrated framework. It helps Huawei gain the competitive edge of MPI collective communication.
Procedure
- Add the --bind-to core parameter to the end of the mpirun command to bind an MPI process to a specific core.
The latest version of Open MPI binds processes to the corresponding CPU cores based on policies by default. However, some special compiler versions may not compile the binding by default. If an MPI program is not bound when the program is running (you can check by running the grep Cpus_allowed_list /proc/<pid>/status command), you can use the --bind-to core parameter.
- Adjust the MPI communication parameter UCX_TLS.
UCX_TLS can be used to adjust the communication protocol used by MPI. The values of the parameter include shm, rc, ud, rc_x, ud_x, dc_x and so on. For details about each value, see https://github.com/openucx/ucx/wiki/UCX-environment-parameters.
The default value may not be optimal in some scenarios. You can try combinations
-x UCX_TLS=shm,rc_x,ud_x
-x UCX_TLS=shm,ud_x
-x UCX_TLS=shm,dc_x
-x UCX_TLS=mm,rc_x,ud_x
and so on. For shm and mm, they cannot be used together. For rc, ud, rc_x, ud_x, and dc_x, you can use one or more of them in the same command.
Not setting this parameter is equivalent to -x UCX_TLS=all by default. In small-scale scenarios, you can leave this parameter empty or use -x UCX_TLS=rc_x,shm.
- Adjust the HCOLL parameters.
The HCOLL parameters are used to control the parameters of the collective communication and can be used in non-standard fat-tree networks. You can run the hcoll_info -a command to view the meaning and settable values of each parameter.
- Reorder the MPI ranks.
There is a lot of communication traffic between MPI processes (also referred to as MPI ranks), and traffic between different processes is different. The IPM can be used to collect the communication traffic between MPI processes. Locate the processes with heavy traffic between MPI processes based on the rank IDs. Place these processes on the same node and TOR switch whenever possible to improve communication performance.
- You can manually compile the MPI rankfile files to achieve the preceding objectives. For details, see the Open MPI documentation. The following is a brief explanation:
The content of the rankfile file myrankfile is as follows:
rank 0=aa slot=2
rank 1=bb slot=3
rank 2=cc slot=1-2
Run the program with rankfile:
mpirun -H aa,bb,cc,dd -rf myrankfile ./a.out
Rank 0 is bound to CPU core 2 on node aa.
Rank 1 is bound to CPU core 3 on node bb.
Rank 2 is bound to CPU cores 1 and 2 on node cc.