我要评分
获取效率
正确性
完整性
易理解

Sample 4: MPI Application Analysis

Introduction

This sample uses the HPC application analysis function of the Kunpeng DevKit System Profiler to analyze an MPI application, helping you learn about the communication status of the application in each rank.

Setting Up the Environment

  1. Check that your server CPU model is Kunpeng 920 and the OS kernel is 4.19 or later or patched openEuler 4.14 or later.
  2. Check that the GCC version on the server is 7.3.0 or later.
  3. Check that the Kunpeng DevKit System Profiler has been installed on the server.
  4. Download the code sample ring.c from GitHub and run the following command to grant all users the read, write, and execute permissions.
    chmod 777 ring.c

Refined Analysis

  1. Prepare the code sample.

    Compile ring.c and grant all users the read, write, and execute permissions on the executable file.

    mpicc ring.c -O3 -o ring -fopenmp -lm && chmod 777 ring
  2. Create an HPC application analysis task to analyze the current application.

    Click next to the System Profiler and select General analysis. On the task creation page that is displayed, select HPC Application, set the required parameters, and click OK to start the HPC application analysis task.

    Figure 1 Creating an HPC application analysis task
    Table 1 Parameter description

    Parameter

    Description

    Analysis Type

    Set it to HPC application analysis.

    Analysis Object

    Set it to Application.

    Mode

    Set it to Launch application.

    Application Path

    Enter the absolute path of the application. In this sample, the sample code is stored in /opt/testdemo/mpi/ring on the server. In a multi-node cluster, the application exists in the directory on the corresponding node.

    Analysis Mode

    Set it to Refined analysis.

    Shared Directory

    If there is only one node, enter an available directory on the operating system. In a multi-node cluster, enter the shared directory between nodes. In this sample, the collection is performed on two nodes, and the shared directory /home/share is used.

    mpirun Path

    Enter the absolute path of the mpirun command.

    mpirun Parameter

    --allow-run-as-root -H node_IP_address:number_of_ranks (for example, --allow-run-as-root -H 192.168.1.10:4)

    Sampling Duration (s)

    Set it to 60. If the sampling duration is too short, the result data may be incomplete because the application running has not completed or the application is stopped.

    Collect More Call Stack Statistics

    Enable this option.

  3. View the analysis results.
    Figure 2 Rank-to-rank heatmap

    As shown in Figure 2, click to view the rank-to-rank heatmap. The data volume of (ranki, rankj) is the data sent by the ranki to rankj plus the data received by ranki from rankj. Move the cursor to select an area in the left part of the following figure to view its details on the right part. You can click or or scroll the mouse wheel to zoom in or zoom out the area.

    Figure 3 Selecting a communicator

    In the displayed dialog box for selecting a communicator, click to search for communicator name and communicator member, click to sort the communicator members, and click View Details to see the communicator information.

    Figure 4 Node-to-node heatmap

    When the statistical object is Node To Node, you can view the local percentage, cross-die percentage, and cross-chip percentage of the current rank.

    Figure 5 MPI timeline

    You can select different color blocks to view a rank's communication mode, communication duration, and communication delay.

    Figure 6 MPI timeline-rank

    You can click a region color block of a rank in a certain period to view the PMU event information in this period.

Statistical Analysis

  1. Prepare the code sample.

    Compile ring.c and grant all users the read, write, and execute permissions on the executable file.

    mpicc ring.c -O3 -o ring -fopenmp -lm && chmod 777 ring
  2. Create an HPC application analysis task and start the analysis.

    Click next to the System Profiler and select General analysis. On the task creation page that is displayed, select HPC Application, set the required parameters, and click OK to start the HPC application analysis task.

    Figure 7 Creating an HPC application analysis task
    Table 2 Parameter description

    Parameter

    Description

    Analysis Type

    Set it to HPC application analysis.

    Analysis Object

    Set it to Application.

    Mode

    Set it to Launch application.

    Application Path

    Enter the absolute path of the application. In this sample, the sample code is stored in /opt/testdemo/mpi/ring on the server. In a multi-node cluster, the application exists in the directory on the corresponding node.

    Analysis Mode

    Set it to Statistical analysis.

    Shared Directory

    If there is only one node, enter an available directory on the operating system. In a multi-node cluster, enter the shared directory between nodes. In this sample, the collection is performed on two nodes, and the shared directory /home/share is used.

    mpirun Path

    Enter the absolute path of the mpirun command.

    mpirun Parameter

    --allow-run-as-root -H node_IP_address:number_of_ranks (for example, --allow-run-as-root -H 192.168.1.10:4)

    Sampling Mode

    Set it to Detail.

    Sampling Duration (s)

    Set it to 60. If the sampling duration is too short, the result data may be incomplete because the application running has not completed or the application is stopped.

  3. View the analysis result.
    Figure 8 Analysis summary

    As shown in Figure 8, the upper part of the Summary tab page displays tuning suggestions, elapsed time, CPU utilization, ratio of CPU cycles/Retired instructions (CPI), number of retired instructions, and MPI wait rate.

    Figure 9 Hotspots

    As shown in Figure 9, the Hotspots area displays the CPU usage of hotspot functions in the application. The grouping mode is function. You can change it to module, parallel-region, or barrier-to-barrier-segment.

    Figure 10 Memory Bandwidth and HPC Top-Down

    As shown in Figure 10, the Memory Bandwidth area displays information about the bandwidth invoked by the current application and the instruction distribution. You can move the mouse pointer to the question mark next to a parameter to view details. The HPC Top-Down area displays the names and proportions of top-down events. You can move the mouse pointer to next to a parameter to view details.

    Figure 11 MPI Runtime Metrics & Original PMU Events

    As shown in Figure 11, the MPI runtime metrics area displays the runtime data of the current MPI application. The Original PMU Events area displays PMU events and the count in the current application.