Rate This Document
Findability
Accuracy
Completeness
Readability

When Allgatherv Algorithm 4 Uses TCP Transmission, an Error Is Reported When np Is Set to a Large Value

Symptom

When the transmission mode of Allgatherv algorithm 4 is TCP, a process error is reported when all cores on multiple nodes are used.

[autotest1@hmpi01 ~]$ mpirun --allow-run-as-root --timeout 350 -np 1024  -N 128  --hostfile ~/hmpifile_2021/hostfile/hf8  -x UCX_TLS=tcp  -x UCG_PLANC_UCX_ALLGATHERV_ATTR=I:4 ~/hmpifile_2021/allgatherv/allgatherv
 
Authorized users only. All activities may be monitored and reported.
[hmpi03:04566] pml_ucx.c:428  Error: ucp_ep_create(proc=514) failed: Destination is unreachable
[hmpi03:04563] pml_ucx.c:428  Error: ucp_ep_create(proc=515) failed: Destination is unreachable
[1684893155.703980] [hmpi06:4797 :1]         tcp_cm.c:749  UCX  WARN  tcp_iface 0x3c6e6b10: connection establishment for socket fd 755 from <invalid address family> to 192.168.0.1:51311 was unsuccessful
[1684893155.703997] [hmpi06:4797 :1]         tcp_cm.c:749  UCX  WARN  tcp_iface 0x3c6e6b10: connection establishment for socket fd 755 from <invalid address family> to 192.168.0.1:51311 was unsuccessful
[1684893256.847887] [hmpi05:11722:0]         tcp_cm.c:705  UCX  ERROR tcp_ep 0x86843a0: reached maximum number of connection retries (25)

Possible Causes

The software interrupt generated during TCP connection setup is heavily loaded. As a result, the CPU cannot process the software interrupt, and TCP connection setup timed out. For the Allgatherv linear algorithm, this is a TCP limitation and is normal.

Procedure

In TCP transmission mode, preferentially use other Allgatherv algorithms.