Overview
The MPI_Allreduce function performs reduction operations on arrays (sum += a1+a2+a3…). In high-concurrency scenarios, the reduction sequence greatly affects floating-point calculation results. In essence, the precision problem is a long-standing reproducibility problem in the parallel fields. The root cause is that there are truncation errors in floating-point calculations performed by the computer, which does not comply with the law of associativity. See Figure 1. The calculation result depends on the operation sequence. When multiple processes or threads are involved in the calculation, the different calculation sequences lead to different results.
In MPI, the floating-point calculations that depend on the operation sequence are MPI_Allreduce, MPI_Reduce, MPI_Reduce_scatter, MPI_Scan, and MPI_Exscan. There are many different collective communication algorithms for MPI collective communications. Different algorithms lead to different reduction sequences, thereby generating different calculation results. The following uses Allreduce algorithm as an example (Recursive Doubling vs Rabenseinfner):


- In MPI collective communication operations, even if the algorithm names are the same, the implementations are different. The Allreduce algorithm is used as an example. The Rabenseifner algorithm consists of two phases: scatter and allgather. Each phase can use different algorithm combinations. Therefore, even if the names are the same, the algorithm implementations are different.



- Even for the same MPI collective communication algorithm, the calculation result can be different each time.
- For the tree algorithm, the reduce process is a many-to-one process. As a result, the arrival sequence is uncontrollable.
- If the parent collects all child data and then performs reduction, the performance is poor, but the results are the same.
- If the parent performs reduction for child data that arrives first, there is a problem with the reduction sequence. The performance is good, but the results are inconsistent.Figure 3 K-nomial tree logic

- The calculation result varies according to platforms.
- For non-topology-aware algorithms, the reduction sequence is fixed.
- For topology-aware algorithms, processes on a node are aggregated to the node leader each time before reduction. If the number of cores on different platforms varies, the results are different.
High-concurrency scenario indicates that at the same time, or in a very short period of time, there are a large number of MPI processes requesting for reduction.
