指定网络类型错误
现象描述
- 提交MPI作业时指定的网络类型错误导致mpirun命令运行失败。
运行失败示例如下:
$ mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=tcp ~/hmpifile_2021/allreduce/AllReduce [1632384264.377328] [arm-node132:2637768:0] ucp_context.c:1073 UCX ERROR no usable transports/devices (asked tcp on network:mlx5_0:1 ) [arm-node132:2637768] *** An error occurred in MPI_Allreduce [arm-node132:2637768] *** reported by process [2539913217,0] [arm-node132:2637768] *** on communicator MPI_COMM_WORLD [arm-node132:2637768] *** MPI_ERR_INTERN: internal error [arm-node132:2637768] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [arm-node132:2637768] *** and potentially your MPI job) [arm-node132:2637714] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal [arm-node132:2637714] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
- 提交MPI作业时传输模式不支持,报如下错误:
UCX WARN transport 'rc_x' is not available, please use one or more of: cma, ib, mm, posix, rc, rc_v, rc_verbs, self, shm, sm, sysv, tcp, ud, ud_v, ud_verbs
可能原因
运行mpirun命令时指定了无效的-x UCX_TLS参数,不同网络设备所支持的传输模式不同。
恢复步骤
父主题: FAQ