Rate This Document
Findability
Accuracy
Completeness
Readability

Specifying Algorithms

Allreduce Algorithms

  • When using algorithms 13 and 14, ensure that the number of processes is evenly distributed on sockets. Add the following options:

    --map-by-socket --rank-by-core

  • When algorithm 5 or 6 is selected, the K value parameter for adjusting the K-nomial tree structure needs to be introduced because the K-nomial algorithm is used in the node. The following commands are examples:

    -x UCG_PLANC_UCX_ALLREDUCE_FANOUT_INTRA_DEGREE=3

    -x UCG_PLANC_UCX_ALLREDUCE_FANIN_INTRA_DEGREE=8

  • When algorithm 7 or 8 is used, the K value parameter for adjusting the K-nomial tree structure needs to be introduced because the K-nomial algorithm is used within a node and between nodes. The following commands are examples:

    -x UCG_PLANC_UCX_ALLREDUCE_FANOUT_INTER_DEGREE=7 -x UCG_PLANC_UCX_ALLREDUCE_FANIN_INTER_DEGREE=7

    -x UCG_PLANC_UCX_ALLREDUCE_FANOUT_INTRA_DEGREE=3 -x UCG_PLANC_UCX_ALLREDUCE_FANIN_INTRA_DEGREE=8

To improve performance, you can also add the following option to the command:

-x UCX_TLS=sm,rc_x

The following is an example of the command used when MPI_Allreduce and MPI_Iallreduce are called (Kunpeng processor):

mpirun -np 16 -N 2 --hostfile hf8 --mca btl ^vader,tcp,openib -x UCX_TLS=sm,rc_x -x UCG_PLANC_UCX_ALLREDUCE_ATTR=I:nS:200R:0- test_case

Bcast Algorithms

  • When algorithm 3 is selected, the parameter for adjusting the K value needs to be added because the K-nomial algorithm is used between nodes. An example command is as follows:

    -x UCG_PLANC_UCX_BCAST_NA_KNTREE_INTER_DEGREE=7

  • When algorithm 4 is selected, two parameters for adjusting the K value need to be added because the K-nomial algorithm is used within a node and between nodes. The following commands are examples:

    -x UCG_PLANC_UCX_BCAST_NA_KNTREE_INTER_DEGREE =7

    -x UCG_PLANC_UCX_BCAST_NA_KNTREE_INTRA_DEGREE=3

  • When algorithm 9 is used, you can set the following parameter to adjust the number of blocks divided in the ESBT algorithm to find the optimal parameter value to achieve the optimal performance. The default value 0 indicates auto. An example command is as follows:

    -x UCG_PLANC_UCX_BCAST_ESBT_BLOCKS=3

To improve performance, you can also add the following option to the command:

-x UCX_TLS=sm,rc_x

The following is an example of the command used when MPI_Bcast and MPI_Ibcast are called (Kunpeng processor):

mpirun -np 16 -N 2 --hostfile hf8 --mca btl ^vader,tcp,openib -x UCX_TLS=sm,rc_x -x UCG_PLANC_UCX_BCAST_ATTR=I:nS:200R:0- test_case

Barrier Algorithms

As listed in Table 3, Barrier algorithms are a subset of Allreduce algorithms. For details about how to specify Barrier algorithms, see the description of specifying Allreduce algorithms.

Scatterv Non-Blocking API Algorithms

  • When algorithm 1 is used, you can adjust the values of the following parameters to find the optimal parameter values to achieve the optimal performance. An example command is as follows:

    -x UCG_PLANC_UCX_SCATTERV_MIN_SEND_BATCH=7

    -x UCG_PLANC_UCX_SCATTERV_MAX_SEND_BATCH=8

  • When algorithm 2 is used, that is, the K-nomial algorithm is used, you can adjust the K value to find the optimal parameter value to achieve the optimal performance. An example command is as follows:

    -x UCG_PLANC_UCX_ SCATTERV_KNTREE_DEGREE=3

To improve performance, you can also add the following option to the command:

-x UCX_TLS=sm,rc_x

The following is an example of the command used when MPI_Iscatterv is called (Kunpeng processor):

mpirun -np 16 -N 2 --hostfile hf8 --mca btl ^vader,tcp,openib -x UCX_TLS=sm,rc_x -x UCG_PLANC_UCX_SCATTERV_ATTR=I:nS:200R:0- test_case

Allgatherv Algorithms

To improve performance, you can also add the following option to the command:

-x UCX_TLS=sm,rc_x

The following is an example of the command used when MPI_Allgatherv and MPI_Iallgatherv are called (Kunpeng processor):

mpirun -np 16 -N 2 --hostfile hf8 --mca btl ^vader,tcp,openib -x UCX_TLS=sm,rc_x -x UCG_PLANC_UCX_ALLGATHERV_ATTR=I:nS:200R:0- test_case