mpirun Malfunctions When It Is Running on Multiple Nodes
Symptom
- When you run the mpirun command on multiple nodes, the system does not respond. When you run the top command, the mpirun process does not exist.
- When you run the mpirun command on multiple nodes, the following error information is displayed:
[1632387881.405868] [arm-node88:57923:0] mm_posix.c:194 UCX ERROR shm_open(file_name=/ucx_shm_posix_23f3f65f flags=0xc2) failed: Permission denied [1632387881.405910] [arm-node88:57923:0] uct_mem.c:132 UCX ERROR failed to allocate 8447 bytes using md posix for mm_recv_fifo: Shared memory error [1632387881.405917] [arm-node88:57923:0] mm_iface.c:605 UCX ERROR mm_iface failed to allocate receive FIFO [arm-node88:57923] coll_ucx_component.c:360 Warning: Failed to create UCG worker, automatically select other available and highest priority collective component. [1632387881.411347] [arm-node88:57923:0] mm_posix.c:194 UCX ERROR shm_open(file_name=/ucx_shm_posix_6ae5143e flags=0xc2) failed: Permission denied [1632387881.411359] [arm-node88:57923:0] uct_mem.c:132 UCX ERROR failed to allocate 8447 bytes using md posix for mm_recv_fifo: Shared memory error [1632387881.411366] [arm-node88:57923:0] mm_iface.c:605 UCX ERROR mm_iface failed to allocate receive FIFO [arm-node88:57923] pml_ucx.c:274 Error: Failed to create UCP worker [arm-node88:57923] *** An error occurred in MPI_Allreduce [arm-node88:57923] *** reported by process [878510081,70368744177671] [arm-node88:57923] *** on communicator MPI_COMM_WORLD [arm-node88:57923] *** MPI_ERR_INTERN: internal error [arm-node88:57923] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [arm-node88:57923] *** and potentially your MPI job)
Possible Causes
When you run the mpirun command on multiple nodes, some nodes cannot communicate with other nodes.
Procedure
- Use PuTTY to log in to a job execution node as a common Hyper MPI user, for example, hmpi_user.
- You are advised to install Hyper MPI in a mounted shared directory.
- Check whether environment variables are correctly configured. For details, see Configuring Environment Variables.
Parent topic: FAQ