Sample 5: Long Application Execution Caused by MPI Blocking Communication Functions

Symptom

By default, an MPI application uses the MPI_Send and MPI_Recv functions to send and receive data by default. The two are blocking communication calls, which may affect the execution of other code. As a result, the execution time of the application is prolonged.

Tuning Strategy

According to the HPC application analysis result of the System Profiler, obtain the percentage of the time consumed by the APIs in the application to the blocking time. Use asynchronous API MPI_Isend and MPI_Irecv calls instead to reduce the blocking time and improve the application execution efficiency.

Procedure

Download the code sample send_recv.cpp from GitHub and compile it.

        
             mpicc send_recv.cpp -O3 -o send_recv -fopenmp -lm

Create an HPC application analysis task to analyze the current application.

Click next to the System Profiler and select General analysis. On the task creation page that is displayed, select HPC Application, set the required parameters, and click OK to start the HPC application analysis task.

Figure 1 Creating an HPC application analysis task

**Table 1** Task parameters
Parameter	Description
Analysis Type	Set it to HPC application analysis.
Analysis Object	Set it to Application.
Mode	Set it to Launch application.
Application Path	Enter the absolute path of the application. In this sample, the sample code is stored in /opt/testdemo/send_recv on the server. In a multi-node cluster, the application exists in the directory on the corresponding node.
Analysis Mode	Set it to Statistical analysis.
Shared Directory	If there is only one node, enter an available directory on the operating system. In a multi-node cluster, enter the shared directory between nodes. In this sample, the collection is performed on two nodes, and the shared directory /home/share is used.
mpirun Path	Enter the absolute path of the mpirun command.
mpirun Parameter	--allow-run-as-root -H node_IP_address:number_of_ranks (for example, --allow-run-as-root -H 192.168.1.10:4)
Sampling Mode	Set it to Summary.
Other Parameters	Retain their default values.

View the task result.
The percentage of MPI Wait Rate is high, in which point-to-point communication is the only contribution. Further analysis of MPI runtime metrics shows that MPI_Send blocking time accounts for a high proportion, affecting application execution efficiency.

Figure 2 Analysis result

Figure 3 MPI runtime metrics

Analyze the source code.

MPI_Send and MPI_Recv are used to send and receive data, which may cause blocking and affect the execution of other code.

         
              ierr = MPI_Send(&number, send_len, MPI_INT, from_rank, tag, MPI_COMM_WORLD);
ierr = MPI_Recv(&number, recv_len, MPI_INT, from_rank, tag, MPI_COMM_WORLD, &status);

Add the MPI_Request variable declaration.
Add the declaration of the MPI_Request variable at the beginning of the source code. This variable is required when calling MPI_Isend and MPI_Irecv. The declaration has been added to the case source code and does not pose any impact if MPI_Isend and MPI_Irecv are not called.
1

MPI_Request request;

Change the APIs to asynchronous APIs.

Change MPI_Send and MPI_Recv to asynchronous APIs MPI_Isend and MPI_Irecv to reduce the blocking time during data transmission and reception.

         
              ierr = MPI_Isend(&number, send_len, MPI_INT, from_rank, tag, MPI_COMM_WORLD, &request);
ierr = MPI_Irecv(&number, recv_len, MPI_INT, from_rank, tag, MPI_COMM_WORLD, &request);

Add asynchronous waiting to adapt to asynchronous communication.
For non-blocking communication functions such as MPI_Isend and MPI_Irecv, use the waiting mechanism to ensure that the communication is complete. Add asynchronous waiting before calling the source code.
1

MPI_Wait(&request, &status);
Recompile the source code and analyze the application again.
Run the command provided in 1 to recompile the application, use the tool to analyze the application, and observe the execution duration and MPI runtime metrics.

Tuning Result

The application execution time is shortened from 60.01s to 30s.

Figure 4 Analysis result after tuning

Figure 5 MPI runtime metrics after tuning

Parent topic: System Profiler