Hyper MPI使用MPI_Allreduce算法6、MPI_Barrier算法5和MPI_Bcast算法3能获得较优的性能指标。
mpirun -np 384 -N 48 --hostfile hf --bind-to core --map-by socket --rank-by core --mca btl ^vader,tcp,openib -x UCX_TLS=sm,ud_x -x UCX_NET_DEVICES=mlx5_0:1 -x UCG_PLANC_UCX_ALLREDUCE_ATTR=I:6S:200R:0- -x UCG_PLANC_UCX_BARRIER_ATTR=I:5S:200R:0 -x UCG_PLANC_UCX_BCAST_ATTR=I:3S:200R:0- -x UCG_PLANC_UCX_ALLREDUCE_FANOUT_INTRA_DEGREE=3 -x UCG_PLANC_UCX_ALLREDUCE_FANIN_INTRA_DEGREE=2 -x UCG_PLANC_UCX_ALLREDUCE_FANOUT_INTER_DEGREE=7 -x UCG_PLANC_UCX_ALLREDUCE_FANIN_INTER_DEGREE=7 -x UCG_PLANC_UCX_BARRIER_FANOUT_INTRA_DEGREE=3 -x UCG_PLANC_UCX_BARRIER_FANIN_INTRA_DEGREE=2 -x UCG_PLANC_UCX_BARRIER_FANOUT_INTER_DEGREE=7 -x UCG_PLANC_UCX_BARRIER_FANIN_INTER_DEGREE=7 test_case
mpirun -np 384 -N 48 --hostfile hf --bind-to core --map-by socket --rank-by core --mca btl ^vader,tcp,openib -x UCX_TLS=sm,ud -x UCX_NET_DEVICES=mlx5_1:1 -x UCG_PLANC_UCX_ALLREDUCE_ATTR=I:6S:200R:0- -x UCG_PLANC_UCX_BARRIER_ATTR=I:5S:200R:0 -x UCG_PLANC_UCX_BCAST_ATTR=I:3S:200R:0- -x UCG_PLANC_UCX_ALLREDUCE_FANOUT_INTRA_DEGREE=3 -x UCG_PLANC_UCX_ALLREDUCE_FANIN_INTRA_DEGREE=2 -x UCG_PLANC_UCX_ALLREDUCE_FANOUT_INTER_DEGREE=7 -x UCG_PLANC_UCX_ALLREDUCE_FANIN_INTER_DEGREE=7 -x UCG_PLANC_UCX_BARRIER_FANOUT_INTRA_DEGREE=3 -x UCG_PLANC_UCX_BARRIER_FANIN_INTRA_DEGREE=2 -x UCG_PLANC_UCX_BARRIER_FANOUT_INTER_DEGREE=7 -x UCG_PLANC_UCX_BARRIER_FANIN_INTER_DEGREE=7 test_case
mpirun -np 384 -N 48 --hostfile hf --bind-to core --map-by socket --rank-by core --mca btl ^vader,tcp,openib -x UCX_TLS=sm,ud -x UCG_PLANC_STARS_TLS=rc_acc -x UCG_PLANC_STARS_RC_ROCE_LOCAL_SUBNET=y -x UCG_PLANC=ucx,stars -x UCX_UD_VERBS_ROCE_LOCAL_SUBNET=y -x UCG_PLANC_STARS_IBCAST_ATTR=I:1S:200R:0 -x UCG_PLANC_STARS_IALLGATHERV_ATTR=I:1S:200R:0 -x UCG_PLANC_STARS_ISCATTERV_ATTR=I:1S:200R:0 -x UCG_PLANC_STARS_IALLTOALLV_ATTR=I:1S:200R:0 -x UCG_PLANC_STARS_IBARRIER_ATTR=I:1S:200R:0 test_case
IB和RoCE两种网络环境均使用鲲鹏服务器架构和Mellanox类型的网卡。