我要评分
获取效率
正确性
完整性
易理解

mpirun Malfunctions When It Is Running on Multiple Nodes

Symptom

  • When you run the mpirun command on multiple nodes, the system does not respond. When you run the top command, the mpirun process does not exist.
  • When you run the mpirun command on multiple nodes, the following error information is displayed:
    [1632387881.405868] [arm-node88:57923:0]       mm_posix.c:194  UCX  ERROR shm_open(file_name=/ucx_shm_posix_23f3f65f flags=0xc2) failed: Permission denied
    [1632387881.405910] [arm-node88:57923:0]        uct_mem.c:132  UCX  ERROR failed to allocate 8447 bytes using md posix for mm_recv_fifo: Shared memory error
    [1632387881.405917] [arm-node88:57923:0]       mm_iface.c:605  UCX  ERROR mm_iface failed to allocate receive FIFO
    [arm-node88:57923] coll_ucx_component.c:360  Warning: Failed to create UCG worker, automatically select other available and highest priority collective component.
    [1632387881.411347] [arm-node88:57923:0]       mm_posix.c:194  UCX  ERROR shm_open(file_name=/ucx_shm_posix_6ae5143e flags=0xc2) failed: Permission denied
    [1632387881.411359] [arm-node88:57923:0]        uct_mem.c:132  UCX  ERROR failed to allocate 8447 bytes using md posix for mm_recv_fifo: Shared memory error
    [1632387881.411366] [arm-node88:57923:0]       mm_iface.c:605  UCX  ERROR mm_iface failed to allocate receive FIFO
    [arm-node88:57923] pml_ucx.c:274  Error: Failed to create UCP worker
    [arm-node88:57923] *** An error occurred in MPI_Allreduce
    [arm-node88:57923] *** reported by process [878510081,70368744177671]
    [arm-node88:57923] *** on communicator MPI_COMM_WORLD
    [arm-node88:57923] *** MPI_ERR_INTERN: internal error
    [arm-node88:57923] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
    [arm-node88:57923] ***    and potentially your MPI job)

Possible Causes

When you run the mpirun command on multiple nodes, some nodes cannot communicate with other nodes.

Procedure

  1. Use PuTTY to log in to a job execution node as a common Hyper MPI user, for example, hmpi_user.
  2. You are advised to install Hyper MPI in a mounted shared directory.
  3. Check whether environment variables are correctly configured. For details, see Configuring Environment Variables.