找不到ARP表项引起的超时报错:ibv_create_ah...failed: Connection timed out
现象描述
用户在提交大规模MPI作业时高概率报错建链超时,作业输出日志报错有“Connection timed out”:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | [agent363:373418:0:373418] ud iface.c:49 Fatal: iface 0x1ddfcb30: failed to get peer address === backtrace (tid: 373418) ==== 0 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucs.so.0(ucs_fatal_error_message+0x38) [0x40012c420128] 1 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucs.so.0(+0x2025c) [0x40012c42025c] 2 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/ucx/libuct_ib.so.0(uct_ud_iface_cep_insert_ep+0) [0x40012c5157e0] 3 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/ucx/libuct_ib.so.0(uct_ud_ep_create_connected_common+0xd4) [0x40012c5184c4] 4 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_wireup_ep_connect_aux+0xc0) [0x400127f51be0] 5 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/1ibucp.so.0(ucp_wireup_ep_connect+0xe4)[0x400127f522e4] 6 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_wireup_init_lanes+0x8d4)[0x400127f53e94] 7 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_ep_create_to_worker_addr+0x78) [0x400127f1cf58] 8 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hucx/1ib/libucp.so.0(ucp_ep_create+0x4b0) [0x400127f1dbbe] 9 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/openmpi/mca_pml_ucx.so(+0x5940)[0x400127e55940] 10 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/openmpi/mca_pm1_ucx.so(mca_pml_ucx_send+0x198) [0x400127e54304] 11 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/libmpi.so.40(ompi_coll_base_sendrecv_actual+0xbc)[0x4001266ffafc] 12 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/libmpi.so.40(ompi_coll_base_sendrecv_intra_bruck+0xac)[0x4001266fe710] 13 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/lib/openmpi/mca_coll_ucx.so(+0x5b08) [0x40012cb25b08] 14 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/lib/1ibmpi.so.40(mca_coll_base_comm_select+0x880)[0x4001266f3d64] 15 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/1ibmpi.so.40(ompi_mpi_init+0xe30) [0x40012672a7a4] 16 /share/software/mpi/hmpi/1.2.1/bisheng2.5.0/hmpi/1ib/1ibmpi.so.40(MPI_Init+0xa8) [0x4001266d8468] 17 hello_mpi() [0x4008c0] 18 /usr/1ib64/1ibc.so.6(__libc_start_main+0xe0) [0x400126813f40] 19 hello_mpi() [0x4007dc] ================================ |
作业错误日志有“failed to get peer address”:
1 2 | [1693797506.380997] [agent289:3924149:0] ib_device.c:1252 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.1.115 sgid index=5 traffic class=106) on mlx5 0 failed: Connection timed out [1693797506.384196] [agent276:1958414:0] ib_device.c:1252 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:192.168.1.62 sgid_index=5 traffic_class=106) on mlx5_O failed: Connection timed out |
可能原因
- 报错信息为MPI的UCX模块ibv_create_ah部分报出,原因为未在超时时间内获取对端地址。
- 交换机的ARP防攻击特性,默认配置为1秒内限制同一个源IP的30个ARP报文通过。Hyper MPI大规模(100+节点)作业启动建链,1秒内会超过30个ARP报文(无ARP缓存情况下),导致ARP丢包,最终导致MPI建链超时。
- 通过Linux本地ARP缓存表,本地缓存对端地址后,可以从本地ARP表中获取对端地址,无需再次发送ARP请求。
恢复步骤
计算节点配置静态ARP表项。
父主题: FAQ