中文
注册

由于丢包等网络原因引起的UD超时报错:UD endpoint ...... unhandled timeout error

现象描述

运行MPI作业时,报错如下:

[arm-129:435170:0:435170]       ud_ep.c:262  Fatal: UD endpoint 0x6c29690 to <no debug data>: unhandled timeout error
==== backtrace (tid: 435170) ====
 0  /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(ucs_handle_error+0x250) [0x4000237b3630]
 1  /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(ucs_fatal_error_message+0xd0) [0x4000237b0940]
 2  /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x1fa08) [0x4000237b0a08]
 3  /workspace/cw/ccsuite/hmpi/install/hucx/lib/ucx/libuct_ib.so.0(+0x475a8) [0x4000238795a8]
 4  /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x18d10) [0x4000237a9d10]
 5  /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucp.so.0(ucp_worker_progress+0x60) [0x4000236f47b0]
 6  /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libopen-pal.so.40(opal_progress+0x38) [0x4000223cbef8]
 7  /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libmpi.so.40(ompi_mpi_init+0xc78) [0x4000220be608]
 8  /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libmpi.so.40(MPI_Init+0x64) [0x400022066404]
 9  /workspace/cw/cwScript/mpijob/bcast_sleep_accurate() [0x400a6c]
10  /usr/lib64/libc.so.6(+0x2afbc) [0x400022146fbc]
11  /usr/lib64/libc.so.6(__libc_start_main+0x94) [0x400022147094]
12  /workspace/cw/cwScript/mpijob/bcast_sleep_accurate() [0x400930]
=================================
[arm-129:435170] *** Process received signal ***
[arm-129:435170] Signal: Aborted (6)
[arm-129:435170] Signal code:  (-6)
[arm-129:435170] [ 0] linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0x400021fd393c]
[arm-129:435170] [ 1] /usr/lib64/libc.so.6(+0x80e78)[0x40002219ce78]
[arm-129:435170] [ 2] /usr/lib64/libc.so.6(raise+0x1c)[0x400022158cfc]
[arm-129:435170] [ 3] /usr/lib64/libc.so.6(abort+0xe0)[0x400022146d2c]
[arm-129:435170] [ 4] /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x1f944)[0x4000237b0944]
[arm-129:435170] [ 5] /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x1fa08)[0x4000237b0a08]
[arm-129:435170] [ 6] /workspace/cw/ccsuite/hmpi/install/hucx/lib/ucx/libuct_ib.so.0(+0x475a8)[0x4000238795a8]
[arm-129:435170] [ 7] /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucs.so.0(+0x18d10)[0x4000237a9d10]
[arm-129:435170] [ 8] /workspace/cw/ccsuite/hmpi/install/hucx/lib/libucp.so.0(ucp_worker_progress+0x60)[0x4000236f47b0]
[arm-129:435170] [ 9] /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libopen-pal.so.40(opal_progress+0x38)[0x4000223cbef8]
[arm-129:435170] [10] /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libmpi.so.40(ompi_mpi_init+0xc78)[0x4000220be608]
[arm-129:435170] [11] /workspace/cw/ccsuite/hmpi/install/hmpi/lib/libmpi.so.40(MPI_Init+0x64)[0x400022066404]
[arm-129:435170] [12] /workspace/cw/cwScript/mpijob/bcast_sleep_accurate[0x400a6c]
[arm-129:435170] [13] /usr/lib64/libc.so.6(+0x2afbc)[0x400022146fbc]
[arm-129:435170] [14] /usr/lib64/libc.so.6(__libc_start_main+0x94)[0x400022147094]
[arm-129:435170] [15] /workspace/cw/cwScript/mpijob/bcast_sleep_accurate[0x400930]
[arm-129:435170] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 435171 on node arm-129 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

可能原因

  • 没有使用补丁版本的HUCX源码包,通过调度器指定UCX_TLS=ud,将MPI作业挂起一段时间再恢复,导致超时报错。
  • 使用RoCE网络运行,但是没有配置网卡侧和交换机侧的无损网络。
  • 计算节点间的网络线路出现了故障。

恢复步骤

  1. 在对应版本HUCX的最新补丁包中已经支持该需求,下载对应版本的补丁包,重新编译安装。
  2. 如果没有配置网卡侧和交换机侧的无损网络,需要配置后再运行作业。
  3. 如果无损网络没有问题,排查出错节点间的网络线路是否有问题。
  4. 如果物理链路和硬件配置上未排查出问题,可以设置-x UCX_UD_MLX5_TIMER_BACKOFF=1 -x UCX_UD_MLX5_TIMER_TICK=100ms -x UCX_UD_MLX5_TIMEOUT=600s增大超时时间,暂时规避问题。
搜索结果
找到“0”个结果

当前产品无相关内容

未找到相关内容,请尝试其他搜索词