提交MPI作业时指定的网卡名称错误导致mpirun命令运行失败。
运行失败示例如下:
$ mpirun -np 8 -N 1 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:2 ~/hmpifile_2021/allreduce/AllReduce [1632383945.549496] [arm-node132:2635376:0] ucp_context.c:732 UCX WARN network device 'mlx5_0:2' is not available, please use one or more of: 'enp189s0f0'(tcp), 'enp1s0'(tcp), 'mlx5_0:1'(ib)
运行mpirun命令时指定的网卡资源名称有误。
ibdev2netdev
mlx5_0 port 1 ==> enp1s0 (Up)
mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 ~/hmpifile_2021/allreduce/AllReduce