Rate This Document
Findability
Accuracy
Completeness
Readability

Specified Host Name Error

Symptom

The host name specified in the hostfile during MPI job submission is incorrect. As a result, the mpirun command fails to be executed.

The following is an example of the execution failure:

$  mpirun  -np 16  -N 2  --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 /home/hmpi_user/hmpifile_2021/allreduce/AllReduce
ssh: Could not resolve hostname arm-node056: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
 
* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default
 
* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.
 
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.
 
*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.
 
* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
 
  my node:   arm-node056
  target node:  arm-node011
 
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[arm-node132:2640389] 5 more processes have sent help message help-errmgr-base.txt / no-path
[arm-node132:2640389] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Possible Causes

The host name specified in the hf8 file does not exist in the LAN when you run the mpirun command.

Procedure

  1. Use PuTTY to log in to a job execution node as a common Hyper MPI user, for example, hmpi_user.
  2. Modify the hf8 file.
    1. Open the hf8 file.

      vi ~/hmpifile_2021/hostfile/hf8

      • ~/hmpifile_2021/hostfile indicates the file path of the specified job execution node. You can replace it as required.
      • hf8 indicates the file of the specified job execution node. You can change the file name as required.
    2. Press i to enter the insert mode and change arm-node056 to arm-node134 (the host name exists in the LAN) based on the error message.
      arm-node134
    3. Press Esc, type :wq!, and press Enter to save the settings and exit.
  3. Run the following command to check whether the hf8 file is successfully modified:

    mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 ~/hmpifile_2021/allreduce/AllReduce

    • ~/hmpifile_2021/allreduce indicates the path for running the job.
    • AllReduce indicates the specified running job. You can change it as required.

    If the following information is displayed, the hf8 file is successfully modified:

    All tests are success