Specified Host Name Error
Symptom
The host name specified in the hostfile during MPI job submission is incorrect. As a result, the mpirun command fails to be executed.
The following is an example of the execution failure:
$ mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 /home/hmpi_user/hmpifile_2021/allreduce/AllReduce ssh: Could not resolve hostname arm-node056: Name or service not known -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- -------------------------------------------------------------------------- ORTE does not know how to route a message to the specified daemon located on the indicated node: my node: arm-node056 target node: arm-node011 This is usually an internal programming error that should be reported to the developers. In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment. We apologize for the problem. -------------------------------------------------------------------------- [arm-node132:2640389] 5 more processes have sent help message help-errmgr-base.txt / no-path [arm-node132:2640389] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Possible Causes
The host name specified in the hf8 file does not exist in the LAN when you run the mpirun command.
Procedure
- Use PuTTY to log in to a job execution node as a common Hyper MPI user, for example, hmpi_user.
- Modify the hf8 file.
- Open the hf8 file.
vi ~/hmpifile_2021/hostfile/hf8
- ~/hmpifile_2021/hostfile indicates the file path of the specified job execution node. You can replace it as required.
- hf8 indicates the file of the specified job execution node. You can change the file name as required.
- Press i to enter the insert mode and change arm-node056 to arm-node134 (the host name exists in the LAN) based on the error message.
arm-node134
- Press Esc, type :wq!, and press Enter to save the settings and exit.
- Open the hf8 file.
- Run the following command to check whether the hf8 file is successfully modified:
mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 ~/hmpifile_2021/allreduce/AllReduce
- ~/hmpifile_2021/allreduce indicates the path for running the job.
- AllReduce indicates the specified running job. You can change it as required.
If the following information is displayed, the hf8 file is successfully modified:
All tests are success
Parent topic: FAQ