Rate This Document
Findability
Accuracy
Completeness
Readability

Specified NIC Name Error

Symptom

The NIC name specified during MPI job submission is incorrect. As a result, the mpirun command fails to be executed.

The following is an example of the execution failure:

$  mpirun  -np 8  -N 1  --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:2 ~/hmpifile_2021/allreduce/AllReduce
[1632383945.549496] [arm-node132:2635376:0]    ucp_context.c:732  UCX  WARN  network device 'mlx5_0:2' is not available, please use one or more of: 'enp189s0f0'(tcp), 'enp1s0'(tcp), 'mlx5_0:1'(ib)

Possible Causes

The NIC resource name specified in mpirun is incorrect.

Procedure

  1. Use PuTTY to log in to a job execution node as a common Hyper MPI user, for example, hmpi_user.
  2. Run the following command to query the names of all available NICs on the current job execution node:

    ibdev2netdev

    mlx5_0 port 1 ==> enp1s0 (Up)
  3. Run the following command to change the incorrect NIC name to an available NIC name, for example, mlx5_0:1.

    mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 ~/hmpifile_2021/allreduce/AllReduce

    • ~/hmpifile_2021/hostfile indicates the file path of the specified job execution node. You can replace it as required.
    • hf8 indicates the file of the specified job execution node. You can change the file name as required.
    • ~/hmpifile_2021/allreduce indicates the path for running the job.
    • AllReduce indicates the specified running job. You can change it as required.
    • mlx5_0:1 indicates the NIC name queried in 2.