When Allgatherv Algorithm 4 Uses TCP Transmission, an Error Is Reported When np Is Set to a Large Value
Symptom
When the transmission mode of Allgatherv algorithm 4 is TCP, a process error is reported when all cores on multiple nodes are used.
[autotest1@hmpi01 ~]$ mpirun --allow-run-as-root --timeout 350 -np 1024 -N 128 --hostfile ~/hmpifile_2021/hostfile/hf8 -x UCX_TLS=tcp -x UCG_PLANC_UCX_ALLGATHERV_ATTR=I:4 ~/hmpifile_2021/allgatherv/allgatherv Authorized users only. All activities may be monitored and reported. [hmpi03:04566] pml_ucx.c:428 Error: ucp_ep_create(proc=514) failed: Destination is unreachable [hmpi03:04563] pml_ucx.c:428 Error: ucp_ep_create(proc=515) failed: Destination is unreachable [1684893155.703980] [hmpi06:4797 :1] tcp_cm.c:749 UCX WARN tcp_iface 0x3c6e6b10: connection establishment for socket fd 755 from <invalid address family> to 192.168.0.1:51311 was unsuccessful [1684893155.703997] [hmpi06:4797 :1] tcp_cm.c:749 UCX WARN tcp_iface 0x3c6e6b10: connection establishment for socket fd 755 from <invalid address family> to 192.168.0.1:51311 was unsuccessful [1684893256.847887] [hmpi05:11722:0] tcp_cm.c:705 UCX ERROR tcp_ep 0x86843a0: reached maximum number of connection retries (25)
Possible Causes
The software interrupt generated during TCP connection setup is heavily loaded. As a result, the CPU cannot process the software interrupt, and TCP connection setup timed out. For the Allgatherv linear algorithm, this is a TCP limitation and is normal.
Procedure
In TCP transmission mode, preferentially use other Allgatherv algorithms.
Parent topic: FAQ