Specified NIC Name Error
Symptom
The NIC name specified during MPI job submission is incorrect. As a result, the mpirun command fails to be executed.
The following is an example of the execution failure:
$ mpirun -np 8 -N 1 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:2 ~/hmpifile_2021/allreduce/AllReduce [1632383945.549496] [arm-node132:2635376:0] ucp_context.c:732 UCX WARN network device 'mlx5_0:2' is not available, please use one or more of: 'enp189s0f0'(tcp), 'enp1s0'(tcp), 'mlx5_0:1'(ib)
Possible Causes
The NIC resource name specified in mpirun is incorrect.
Procedure
- Use PuTTY to log in to a job execution node as a common Hyper MPI user, for example, hmpi_user.
- Run the following command to query the names of all available NICs on the current job execution node:
ibdev2netdev
mlx5_0 port 1 ==> enp1s0 (Up)
- Run the following command to change the incorrect NIC name to an available NIC name, for example, mlx5_0:1.
mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 ~/hmpifile_2021/allreduce/AllReduce
- ~/hmpifile_2021/hostfile indicates the file path of the specified job execution node. You can replace it as required.
- hf8 indicates the file of the specified job execution node. You can change the file name as required.
- ~/hmpifile_2021/allreduce indicates the path for running the job.
- AllReduce indicates the specified running job. You can change it as required.
- mlx5_0:1 indicates the NIC name queried in 2.
Parent topic: FAQ