鲲鹏社区首页
中文
注册
开发者
我要评分
获取效率
正确性
完整性
易理解
在线提单
论坛求助

指定主机名错误,报错:Name or service not known

现象描述

提交MPI作业时指定hostfile中的主机名错误导致mpirun命令运行失败。

运行失败示例如下:

mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 /home/hmpi_user/hmpifile_2021/allreduce/AllReduce

ssh: Could not resolve hostname arm-node056: Name or service not known 
-------------------------------------------------------------------------- 
ORTE was unable to reliably start one or more daemons. 
This usually is caused by: 
  
* not finding the required libraries and/or binaries on 
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH 
  settings, or configure OMPI with --enable-orterun-prefix-by-default 
  
* lack of authority to execute on one or more specified nodes. 
  Please verify your allocation and authorities. 
  
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). 
  Please check with your sys admin to determine the correct location to use. 
  
*  compilation of the orted with dynamic libraries when static are required 
  (e.g., on Cray). Please check your configure cmd line and consider using 
  one of the contrib/platform definitions for your system type. 
  
* an inability to create a connection back to mpirun due to a 
  lack of common network interfaces and/or no route found between 
  them. Please check network connectivity (including firewalls 
  and network routing requirements). 
-------------------------------------------------------------------------- 
-------------------------------------------------------------------------- 
ORTE does not know how to route a message to the specified daemon 
located on the indicated node: 
  
  my node:   arm-node056 
  target node:  arm-node011 
  
This is usually an internal programming error that should be 
reported to the developers. In the meantime, a workaround may 
be to set the MCA param routed=direct on the command line or 
in your environment. We apologize for the problem. 
-------------------------------------------------------------------------- 
[arm-node132:2640389] 5 more processes have sent help message help-errmgr-base.txt / no-path 
[arm-node132:2640389] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

可能原因

运行mpirun命令时hf8文件中指定的主机名在局域网中不存在。

恢复步骤

  1. 使用PuTTY工具,以Hyper MPI普通用户(例如“hmpi_user”)登录至作业执行节点。
  2. 执行以下命令,修改“hf8”文件。
    1. 打开“hf8”文件。

      vi ~/hmpifile_2021/hostfile/hf8

    • ~/hmpifile_2021/hostfile:表示指定作业运行节点的文件路径。
    • hf8:表示指定作业运行节点的文件。
    1. 按“i”进入编辑模式,根据错误提示将“arm-node056”改成“arm-node134”(局域网中存在此主机名)。
      arm-node134
    2. 按“Esc”键,输入:wq!,按“Enter”保存并退出编辑。
  3. 执行以下命令,验证“hf8”文件是否修改成功。

    mpirun -np 16 -N 2 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_NET_DEVICES=mlx5_0:1 ~/hmpifile_2021/allreduce/AllReduce

    • ~/hmpifile_2021/allreduce:表示指定运行作业的路径。
    • AllReduce:表示指定的运行作业。

    出现以下回显信息,表示“hf8”文件修改成功。

    All tests are success