Rate This Document
Findability
Accuracy
Completeness
Readability

HPC Application Analysis

Command Function

Collects PMU events of the system and the key metrics of OpenMP and MPI applications to help you accurately obtain the serial and parallel time of the parallel region and barrier-to-barrier, calibrated 2-layer microarchitecture metrics, instruction distribution, L3 usage, and memory bandwidth.

  • In the multi-node scenario, the tool must be installed (or decompressed) in the shared directory. In the multi-node scenario, you need to specify a node by using the -H parameter to add node information.
  • You can import the tasks for which TAR packages have been generated to the WebUI for visualized viewing. For details, see contents about importing tasks in Task Management.
  • You are required to run the mpirun command and add an application after the command.

Syntax

mpirun -n 4 devkit tuner hpc-perf -L summary <command> [<options>]
  • The original mpirun command is mpirun -n 4 <command> [<options>].
  • An RPM package is used as an example. If you use a compressed package, an absolute path is required, for example, /home/DevKit-CLI-x.x.x-Linux-Kunpeng/devkit tuner.
  • If you run the command as a common user, ensure that the value of the compute node /proc/sys/kernel/perf_event_paranoid is -1 and that of /proc/sys/kernel/kptr_restrict is 0.

Parameter Description

Table 1 Parameter description

Parameter

Option

Description

-h/--help

None

Obtains help information.

-o/--output

PATH

Sets the name of the path where the data package is generated. The default path is the current path.

The default value is the path specified by rank0. (You can access the path on rank0. If failed, access the path where the tool is located.)

-l/--log-level

0,1,2,3

Configures the log level. The default value is 1 (info).

  • 0: debug
  • 1: info
  • 2: warning
  • 3: error

-d/--duration

Num

Sets the collection duration, in seconds. If this parameter is not set, continuous collection is performed by default.

-L/--profile-level

  • summary
  • detail
  • graphic

Specifies the task collection type. If this parameter is not specified, the default value summary is used.

  • summary: collects basic metrics. This mode has low sampling and analysis overheads.
  • detail: collects hotspot functions and detailed metrics. This mode has high sampling and analysis overheads.
  • graphic: collects communication heatmap information.

-D/--delay

Num

Specifies the sampling delay. The default value is 0 if this parameter is not specified.

-i/--interval

Num

Specifies the sampling interval. The default sampling interval is 5,000 ms if this parameter is not specified.

--system-collect

None

Sets whether to collect system performance metrics.

This parameter can be set when the task collection type is summary or detail.

--topn

None

Sets whether to collect top N inefficient communications.

This parameter can be set when the task collection type is graphic.

--critical-path

None

Sets whether to collect critical paths.

This parameter can be set when the task collection type is summary or detail.

--mpi-only

None

Sets whether to collect only MPI metrics.

This parameter can be set when the task collection type is summary or detail.

--call-stack

None

Sets whether to collect call stack information.

This parameter can be set when the task collection type is graphic.

--rank-fuzzy

Num

Specifies the fuzzification value. The default value is 12800. This parameter can be set when the task collection type is graphic.

--region-max

>1000

Specifies the number of communication areas displayed in the sequence diagram. The default value is 1000. The configured value must be greater than 1000. This parameter can be set when the task collection type is graphic.

--rdma-collect

1-15

Specifies the sampling interval for collecting remote direct memory access (RDMA) performance metrics. If this parameter is not specified, RDMA performance metrics are not collected. The value ranges from 1 to 15 seconds. This parameter can be set when the task collection type is graphic.

--shared-storage

1-15

Specifies the sampling interval for collecting shared folder performance metrics. If this parameter is not specified, RDMA performance metrics are not collected. The value ranges from 1 to 15 seconds. This parameter can be set when the task collection type is graphic.

Example

  • Collect basic metrics when the task type is summary and the application is /opt/test/testdemo/ring:
    /opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf /opt/test/testdemo/ring

    If the -L parameter is not specified, basic metrics under summary are collected.

    Command output:

    [Rank000][localhost.localdomain] =======================================PARAM CHECK=======================================
    [Rank000][localhost.localdomain] PARAM CHECK success.
    [Rank000][localhost.localdomain] ===================================PREPARE FOR COLLECT===================================
    [Rank000][localhost.localdomain] preparation of collection success.
    [Rank000][localhost.localdomain] ==================================COLLECT AND ANALYSIS ==================================
    Collection duration: 1.00 s, collect until application finish
    Collection duration: 2.01 s, collect until application finish
    Collection duration: 3.01 s, collect until application finish
    ...
    ...
    ...
    Time measured: 0.311326 seconds.
    Collection duration: 4.01 s, collect until application finish
    Collection duration: 5.02 s, collect until application finish
    Collection duration: 6.02 s, collect until application finish
    done
    Resolving symbols...done
    Symbols reduction...done
    Calculating MPI imbalance...0.0011 sec
    Aggregating MPI/OpenMP data...done
    Processing hardware events data started
    Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Collection duration: 7.03 s, collect until application finish
    0.173 sec
    Sorting samples...0.173 sec
    Sorting samples...0.000 sec
    Loading MPI critical path segments...0.000 sec
    Sorting MPI critical path segments...0.000 sec
    Aggregating samples...0.000 sec
    ...
    ...
    ...
    Raw collection data is stored in /tmp/.devkit_3b9014edeb20b0ed674a9121f1996fb0/TMP_HPCTOOL_DATA/my_raw_data.v1_19.mpirun-3204841473
    Issue#1: High CPI value (0.67), ideal value is 0.25. It indicates non-efficient
      CPU MicroArchitecture usage. Possible solutions:
    Issue#1: High CPI value (0.67), ideal value is 0.25. It indicates non-efficient
      CPU MicroArchitecture usage. Possible solutions:
      1. Top-down  MicroArchitecture  tree  shows  high value of BackEnd Bound/Core
         Bound metric (0.62).
    Issue#2: CPU  under-utilization  -  inappropriate  amount  of  ranks.  Possible
      solutions:
      1. Consider  increasing total amount of MPI ranks from 10 to 128 using mpirun
         -n  option,  or  try  to  parallelize code with both MPI and OpenMP so the
         number  of  processes(ranks  * OMP_NUM_THREADS) will be equal to CPUs(128)
         count.
    Issue#3: High  Inter Socket Bandwidth value (5.82 GB/s). Average DRAM Bandwidth
      is 46.25 GB/s. Possible solutions:
      1. Consider   allocating   memory   on   the  same  NUMA  node  it  is  used.
    HINT:  Consider  re-running  collection  with  -l  detail  option  to  get more
      information about microarchitecture related issues.
    The report /home/hpc-perf/hpc-perf-20240314-110009.tar is generated successfully
    To view the summary report, you can run: devkit report -i /home/hpc-perf/hpc-perf-20240314-110009.tar
    To view the detail report, you can import the report to WebUI or IDE
    [Rank000][localhost.localdomain] =====================================RESTORE CLUSTER=====================================
    [Rank000][localhost.localdomain] restore cluster success.
    [Rank000][localhost.localdomain] =========================================FINISH =========================================
  • The task type is detail (common) and the application is /opt/test/testdemo/ring:
    /opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1  devkit tuner hpc-perf -L detail -o /home/hpc-perf-detail.tar -l 0 -d 20 -D 15 /opt/test/testdemo/ring

    -o sets the path of the generated report package. -l sets the log level with the value 0 indicating DEBUG. -d 20 indicates that the collection duration of the application is 20 seconds. -D 15 indicates that the collection starts after a delay of 15 seconds (collection starts after the application runs for 15 seconds). The preceding parameters support all task collection types.

  • The task type is detail (multiple parameters) and the application is /opt/test/testdemo/ring:
    /opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf -L detail -o /home/hpc-perf-detail.tar -l 0 -d 20 -D 15 --system-collect --mpi-only --critical-path /opt/test/testdemo/ring

    --system-collect indicates that system performance metrics are collected. --mpi-only reduces the collection and analysis of OMP and improves performance. --critical-path collects critical path information. The preceding parameters support the summary and detail task collection types.

  • Collect communication heatmap information when the task type is graphic and the application is /opt/test/testdemo/ring:
    /opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1  devkit tuner hpc-perf -L graphic -d 300 --topn --call-stack --rank-fuzzy 128 --region-max 1000 --rdma-collect 1 --shared-storage 1 /opt/test/testdemo/ring

    --call-stack collects call stack data. --rank-fuzzy adjusts the fuzzification value of the heatmap. --region-max adjusts the number of regions in the sequence diagram. --rdma-collect 1 indicates that RDMA data is collected every second. --shared-storage 1 indicates that shared storage data is collected every second. The preceding parameters support the graphic task collection type.

    Command output:

    [Rank000][localhost.localdomain] =======================================PARAM CHECK=======================================
    [Rank000][localhost.localdomain] PARAM CHECK success.
    [Rank000][localhost.localdomain] ===================================PREPARE FOR COLLECT===================================
    [Rank000][localhost.localdomain] preparation of collection success.
    [Rank000][localhost.localdomain] ==================================COLLECT AND ANALYSIS ==================================
    loop time :2
    send message from rank 4 to rank 9
    send message from rank 4 to rank 9
    loop time :2
    ...
    ...
    ...
    send message from rank 5 to rank 0
    Time measured: 0.310607 seconds.
    barrier rank 8
    send message from rank 8 to rank 3
    Time measured: 0.310637 seconds.
    Finish collection.Progress 100%.
    Postprocessing OTF2 trace(s)...
    Successful
    OTF2 traces are stored in /tmp/.devkit_1d8ffdfa224a7372beceb19b52a8c510/TMP_HPCTOOL_DATA/my_result.v1_19.rank*.otf2
    Raw collection data is stored in /tmp/.devkit_1d8ffdfa224a7372beceb19b52a8c510/TMP_HPCTOOL_DATA/my_raw_data.v1_19.mpirun-2708930561
    4 region has been processed, cost time 0.19 s.
    The report /home/hpc-perf/hpc-perf-20240314-110450.tar is generated successfully
    To view the detail report, you can import the report to the WebUI or IDE
    [Rank000][localhost.localdomain] =====================================RESTORE CLUSTER=====================================
    [Rank000][localhost.localdomain] restore cluster success.
    [Rank000][localhost.localdomain] =========================================FINISH =========================================
  • Use the report function to view the task report.
    devkit report -i /home/hpc-perf/hpc-perf-xxxxxxx-xxxxxxx.tar

    Command output:

    Elapsed Time                          :               0.3161 s (rank 000, jitter = 0.0002s)
    CPU Utilization                       :                 7.64 % (9.78 out of 128 CPUs)
      Effective Utilization               :                 7.64 % (9.78 out of 128 CPUs)
      Spinning                            :                 0.00 % (0.00 out of 128 CPUs)
      Overhead                            :                 0.00 % (0.00 out of 128 CPUs)
      Cycles per Instruction (CPI)        :               0.6718
      Instructions Retired                :          11961967858
    MPI
    ===
      MPI Wait Rate                       :                 0.20 %
        Imbalance Rate                    :                 0.14 %
        Transfer Rate                     :                 0.06 %
        Top Waiting MPI calls
        Function     Caller location                         Wait(%)  Imb(%)  Transfer(%)
        ---------------------------------------------------------------------------------
        MPI_Barrier  main@test_time_loop.c:72                   0.17    0.13         0.03
        MPI_Recv     first_communicate@test_time_loop.c:17      0.02    0.01         0.01
        MPI_Send     first_communicate@test_time_loop.c:14      0.01       0         0.01
        MPI_Send     second_communicate@test_time_loop.c:31     0.00       0         0.00
        MPI_Recv     second_communicate@test_time_loop.c:28     0.00    0.00         0.00
      Top Hotspots on MPI Critical Path
      Function      Module     CPU Time(s)  Inst Retired     CPI
      ----------------------------------------------------------
      do_n_multpli  test_loop       0.3101    1189348874  0.6780
      Top MPI Critical Path Segments
      MPI Critical Path segment                                        Elapsed Time(s)  CPU Time(s)  Inst Retired     CPI
      -------------------------------------------------------------------------------------------------------------------
      MPI_Send@test_time_loop.c:14 to MPI_Barrier@test_time_loop.c:72           0.3158       0.3101    1189348874  0.6780
      MPI_Send@test_time_loop.c:14 to MPI_Send@test_time_loop.c:14              0.0000
    Instruction Mix
    ===============
      Memory                              :                58.20 %
      Integer                             :                25.11 %
      Floating Point                      :                 0.00 %
      Advanced SIMD                       :                 0.00 %
      Not Retired                         :                 0.00 %
    Top-down
    ========
      Retiring                            :                37.21 %
      Backend Bound                       :                62.85 %
        Memory Bound                      :                 0.56 %
          L1 Bound                        :                 0.55 %
          L2 Bound                        : value is out of range likely because of not enough samples collected
          L3 or DRAM Bound                :                 0.01 %
          Store Bound                     :                 0.00 %
        Core Bound                        :                62.30 %
      Frontend Bound                      :                 0.02 %
      Bad Speculation                     :                 0.00 %
    Memory subsystem
    ================
      Average DRAM Bandwidth              :              46.2486 GB/s
        Read                              :              30.5352 GB/s
        Write                             :              15.7134 GB/s
      L3 By-Pass ratio                    :                20.23 %
      L3 miss ratio                       :                59.96 %
      L3 Utilization Efficiency           :                54.90 %
      Within Socket Bandwidth             :               3.8750 GB/s
      Inter Socket Bandwidth              :               5.8249 GB/s
  • Output description

    1. For the detail and summary task collection types, you can use the report function to view the results.

    2. For the graphic task collection type, you can import the generated package to WebUI for visualized information.

    3. For details about the HPC metrics when using the report function, see the WebUI parameter description in Viewing Analysis Results.