Rate This Document
Findability
Accuracy
Completeness
Readability

Specified Network Type Error

Symptom

  • The network type specified during MPI job submission is incorrect. As a result, the mpirun command fails to be executed.
    The following is an example of the execution failure:
    $ mpirun  -np 16  -N 2  --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=tcp ~/hmpifile_2021/allreduce/AllReduce
    [1632384264.377328] [arm-node132:2637768:0]    ucp_context.c:1073 UCX  ERROR no usable transports/devices (asked tcp on network:mlx5_0:1 )
     [arm-node132:2637768] *** An error occurred in MPI_Allreduce
    [arm-node132:2637768] *** reported by process [2539913217,0]
    [arm-node132:2637768] *** on communicator MPI_COMM_WORLD
    [arm-node132:2637768] *** MPI_ERR_INTERN: internal error
    [arm-node132:2637768] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
    [arm-node132:2637768] ***    and potentially your MPI job)
    [arm-node132:2637714] 7 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
    [arm-node132:2637714] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
  • During submission of an MPI job, the transmission mode is not supported, and the following error is reported:
    UCX  WARN  transport 'rc_x' is not available, please use one or more of: cma, ib, mm, posix, rc, rc_v, rc_verbs, self, shm, sm, sysv, tcp, ud, ud_v, ud_verbs

Possible Causes

An invalid -x UCX_TLS parameter is specified when you run the mpirun command. Different network devices support different transmission modes.

Procedure

  1. Use PuTTY to log in to a job execution node as a common Hyper MPI user, for example, hmpi_user.
  2. Run the following command to query all available NIC types and network protocols on the current job execution node:

    ucx_info -d

    Transport: rc_mlx5
    Device: mlx5_0:1
  3. Run the following command to change the network protocol to an available one:

    mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_TLS=rc_mlx5 ~/hmpifile_2021/allreduce/AllReduce

    • ~/hmpifile_2021/hostfile indicates the file path of the specified job execution node. You can replace it as required.
    • hf8 indicates the file of the specified job execution node. You can change the file name as required.
    • ~/hmpifile_2021/allreduce indicates the path for running the job.
    • AllReduce indicates the specified running job. You can change it as required.
    • mlx5_0:1 indicates the NIC type queried in 2.
    • rc_mlx5 indicates the network protocol queried in 2.