HPC Application Analysis
HPC is a technology that leverages powerful processor clusters to process massive amounts of multi-dimensional datasets (also called big data) in parallel mode and solve complex problems at high speeds. The HPC application analysis function provides multiple task modes to collect and analyze key metrics of HPC applications in scenarios with different resource overheads. It also provides tuning suggestions to help improve application performance.
HPC application analysis can be performed only on physical servers with Kunpeng 920 series processors. The collection of
Prerequisites
- Analyzing top-down and DRAM bandwidth data in HPC application tasks requires OS kernel 4.19 or later, or an openEuler 4.14 kernel (with patches) or later.
- If the mpirun type is MPICH, add the path of libmpi.so.12 to LD_LIBRARY_PATH, which is generally the lib folder in the MPICH installation path.
- The C, C++, and Fortran programming languages must be enabled in the MPI scenario (Open MPI/MPICH). For example, for Open MPI that uses the GCC, run --enable-mpi-compatibility CC=gcc CXX=g++ FC=gfortran to enable C, C++, and Fortran.
- Choose Open MPI 4.1.6 (preferred) or a later version and MPICH 4.3.0 (preferred) or a later version.
Command Function
Collects PMU events of the system and the key metrics of OpenMP and MPI applications to help you accurately obtain the serial and parallel times of the parallel region and barrier-to-barrier, calibrated L2 microarchitecture metrics, instruction distribution, L3 usage, and memory bandwidth.
- In a multi-node scenario, the tool must be installed (or extracted) in a shared directory, and you need to add node information using the -H parameter.
- You are required to run the mpirun command and add an application after the command.
Syntax
1 | mpirun -n 4 devkit tuner hpc-perf -L summary <command> [<options>] |
- The original mpirun command is mpirun -n 4 <command> [<options>].
- If you run the command as a common user, ensure that the /proc/sys/kernel/perf_event_paranoid value of the compute node is -1 and /proc/sys/kernel/kptr_restrict is 0.
Parameter Description
Parameter |
Option |
Description |
|---|---|---|
-h/--help |
- |
Obtains help information. This parameter is optional. |
-L/--profile-level |
summary/detail/graphic |
Task collection type, which defaults to summary. This parameter is optional.
|
-o/--output |
- |
Name and output path of the generated data file. If -o/--output is not specified, an hpc-perf-Timestamp.tar file is generated in the /tuner/data/ directory of the tool directory by default. This parameter is optional. For cluster collection, the default value is the path specified by rank0. (You can access the path on rank0. If failed, access the path where the tool is located.) |
-l/--log-level |
0/1/2/3 |
Log level, which defaults to 1. This parameter is optional.
|
-d/--duration |
- |
Collection duration, in seconds. Collection never ends by default. This parameter is optional. |
-D/--delay |
- |
Collection delay duration, which defaults to 0 seconds. The applicable task collection type is summary or detail. This parameter is optional. |
--critical-path |
- |
Indicates whether to collect the data of critical paths. The applicable task collection type is summary or detail. This parameter is optional. |
--mpi-only |
- |
Indicates whether to collect MPI data only. The applicable task collection type is summary or detail. This parameter is optional. |
--call-stack |
- |
Indicates whether to collect call stack data. The applicable task collection type is graphic. This parameter is optional. |
--rank-fuzzy |
- |
Fuzzification ratio, which defaults to 12800. The applicable task collection type is graphic. This parameter is optional. |
--region-max |
- |
Number of communication areas displayed in the sequence diagram, which defaults to 1000. You can set a value greater than 1000. This parameter is optional. The applicable task collection type is graphic. |
--rdma-collect |
- |
Collection interval for collecting remote direct memory access (RDMA) performance metrics, ranging from 1 to 15 seconds. If this parameter is not set, RDMA performance metrics are not collected. This parameter is optional. The applicable task collection type is graphic. |
--shared-storage |
- |
Collection interval for shared storage performance metrics, ranging from 1 to 15 seconds. If this parameter is not set, shared storage performance metrics are not collected. This parameter is optional. The applicable task collection type is graphic. |
Example
- Collect basic metrics when the task collection type is summary and the application is /opt/test/testdemo/ring:
1/opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf /opt/test/testdemo/ring
If the -L parameter is not specified, basic metrics under summary are collected.
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
[Rank000][localhost.localdomain] =======================================PARAM CHECK======================================= [Rank000][localhost.localdomain] PARAM CHECK success. [Rank000][localhost.localdomain] ===================================PREPARE FOR COLLECT=================================== [Rank000][localhost.localdomain] preparation of collection success. [Rank000][localhost.localdomain] ==================================COLLECT AND ANALYSIS ================================== Collection duration: 1.00 s, collect until application finish Collection duration: 2.01 s, collect until application finish Collection duration: 3.01 s, collect until application finish ... ... ... Time measured: 0.311326 seconds. Collection duration: 4.01 s, collect until application finish Collection duration: 5.02 s, collect until application finish Collection duration: 6.02 s, collect until application finish done Resolving symbols...done Symbols reduction...done Calculating MPI imbalance...0.0011 sec Aggregating MPI/OpenMP data...done Processing hardware events data started Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Collection duration: 7.03 s, collect until application finish 0.173 sec Sorting samples...0.173 sec Sorting samples...0.000 sec Loading MPI critical path segments...0.000 sec Sorting MPI critical path segments...0.000 sec Aggregating samples...0.000 sec ... ... ... Raw collection data is stored in /tmp/.devkit_3b9014edeb20b0ed674a9121f1996fb0/TMP_HPCTOOL_DATA/my_raw_data.v1_19.mpirun-3204841473 Issue#1: High CPI value (0.67), ideal value is 0.25. It indicates non-efficient CPU MicroArchitecture usage. Possible solutions: 1. Top-down MicroArchitecture tree shows high value of BackEnd Bound/Core Bound metric (0.62). Issue#2: CPU under-utilization - inappropriate amount of ranks. Possible solutions: 1. Consider increasing total amount of MPI ranks from 10 to 128 using mpirun -n option, or try to parallelize code with both MPI and OpenMP so the number of processes(ranks * OMP_NUM_THREADS) will be equal to CPUs(128) count. Issue#3: High Inter Socket Bandwidth value (5.82 GB/s). Average DRAM Bandwidth is 46.25 GB/s. Possible solutions: 1. Consider allocating memory on the same NUMA node it is used. HINT: Consider re-running collection with -l detail option to get more information about microarchitecture related issues. The report /home/hpc-perf/hpc-perf-20240314-110009.tar is generated successfully To view the summary report,you can run: devkit report -i /home/hpc-perf/hpc-perf-20240314-110009.tar To view the detail report,you can import the report to WebUI or IDE [Rank000][localhost.localdomain] =====================================RESTORE CLUSTER===================================== [Rank000][localhost.localdomain] restore cluster success. [Rank000][localhost.localdomain] =========================================FINISH =========================================
- The task collection type is detail (common) and the application is /opt/test/testdemo/ring:
1/opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf -L detail -o /home/hpc-perf-detail.tar -l 0 -d 20 -D 15 /opt/test/testdemo/ring
-o sets the path of the generated report package. -l sets the log level with the value 0 indicating DEBUG. -d 20 indicates that the collection duration of the application is 20 seconds. -D 15 indicates that the collection starts after a delay of 15 seconds (collection starts after the application runs for 15 seconds). The preceding parameters support all task collection types.
- The task collection type is detail (multiple parameters) and the application is /opt/test/testdemo/ring:
1/opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf -L detail -o /home/hpc-perf-detail.tar -l 0 -d 20 -D 15 --mpi-only --critical-path /opt/test/testdemo/ring
--mpi-only can reduce the OMP collection and analysis workload, and --critical-path is used to collect key path information. The preceding parameters support the summary and detail task collection types.
- Collect communication heatmap information when the task collection type is graphic and the application is /opt/test/testdemo/ring:
1/opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf -L graphic -d 300 --call-stack --rank-fuzzy 128 --region-max 1000 --rdma-collect 1 --shared-storage 1 /opt/test/testdemo/ring
--call-stack collects call stack data. --rank-fuzzy adjusts the fuzzification value of the heatmap. --region-max adjusts the number of regions in the sequence diagram. --rdma-collect 1 indicates that RDMA data is collected every second. --shared-storage 1 indicates that shared storage data is collected every second. The preceding parameters support the graphic task collection type.
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
[Rank000][localhost.localdomain] =======================================PARAM CHECK======================================= [Rank000][localhost.localdomain] PARAM CHECK success. [Rank000][localhost.localdomain] ===================================PREPARE FOR COLLECT=================================== [Rank000][localhost.localdomain] preparation of collection success. [Rank000][localhost.localdomain] ==================================COLLECT AND ANALYSIS ================================== loop time :2 send message from rank 4 to rank 9 send message from rank 4 to rank 9 loop time :2 ... ... ... send message from rank 5 to rank 0 Time measured: 0.310607 seconds. barrier rank 8 send message from rank 8 to rank 3 Time measured: 0.310637 seconds. Finish collection.Progress 100%. Postprocessing OTF2 trace(s)... Successful OTF2 traces are stored in /tmp/.devkit_1d8ffdfa224a7372beceb19b52a8c510/TMP_HPCTOOL_DATA/my_result.v1_19.rank*.otf2 Raw collection data is stored in /tmp/.devkit_1d8ffdfa224a7372beceb19b52a8c510/TMP_HPCTOOL_DATA/my_raw_data.v1_19.mpirun-2708930561 4 region has been processed, cost time 0.19 s. The report /home/hpc-perf/hpc-perf-20240314-110450.tar is generated successfully To view the detail report, you can import the report to the WebUI or IDE [Rank000][localhost.localdomain] =====================================RESTORE CLUSTER===================================== [Rank000][localhost.localdomain] restore cluster success. [Rank000][localhost.localdomain] =========================================FINISH =========================================
- Use the report function to view the task report.
1devkit report -i /home/hpc-perf/hpc-perf-xxxxxxx-xxxxxxx.tar
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109
Elapsed Time : 0.3161 s (rank 000, jitter = 0.0002s) CPU Utilization : 7.64 % (9.78 out of 128 CPUs) Effective Utilization : 7.64 % (9.78 out of 128 CPUs) Spinning : 0.00 % (0.00 out of 128 CPUs) Overhead : 0.00 % (0.00 out of 128 CPUs) Cycles per Instruction (CPI) : 0.6718 Instructions Retired : 11961967858 MPI === MPI Wait Rate : 0.20 % Imbalance Rate : 0.14 % Transfer Rate : 0.06 % Top Waiting MPI calls Function Caller location Wait(%) Imb(%) Transfer(%) --------------------------------------------------------------------------------- MPI_Barrier main@test_time_loop.c:72 0.17 0.13 0.03 MPI_Recv first_communicate@test_time_loop.c:17 0.02 0.01 0.01 MPI_Send first_communicate@test_time_loop.c:14 0.01 0 0.01 MPI_Send second_communicate@test_time_loop.c:31 0.00 0 0.00 MPI_Recv second_communicate@test_time_loop.c:28 0.00 0.00 0.00 Top Hotspots on MPI Critical Path Function Module CPU Time(s) Inst Retired CPI ---------------------------------------------------------- do_n_multpli test_loop 0.3101 1189348874 0.6780 Top MPI Critical Path Segments MPI Critical Path segment Elapsed Time(s) CPU Time(s) Inst Retired CPI ------------------------------------------------------------------------------------------------------------------- MPI_Send@test_time_loop.c:14 to MPI_Barrier@test_time_loop.c:72 0.3158 0.3101 1189348874 0.6780 MPI_Send@test_time_loop.c:14 to MPI_Send@test_time_loop.c:14 0.0000 Instruction Mix =============== Memory : 16.70 % Load : 11.71 % Store : 4.87 % Scalar : 39.26 % Integer : 23.89 % Floating Point : 15.37 % Vector : 0.08 % Advanced SIMD : 0.08 % Crypto : 0.00 % Branches : 14.08 % Immediate : 10.91 % Indirect : 2.07 % Return : 1.10 % Barriers : 0.10 % Instruction Synchronization : 0.03 % Data Synchronization : 0.00 % Data Memory : 0.06 % Not Retired : 5.10 % Top-down ======== Retiring : 37.21 % Backend Bound : 62.85 % Memory Bound : 0.56 % L1 Bound : 0.55 % L2 Bound : value is out of range likely because of not enough samples collected L3 or DRAM Bound : 0.01 % Store Bound : 0.00 % Core Bound : 62.30 % Frontend Bound : 0.02 % Bad Speculation : 0.00 % Memory subsystem ================ Average DRAM Bandwidth : 46.2486 GB/s Read : 30.5352 GB/s Write : 15.7134 GB/s L3 By-Pass ratio : 20.23 % L3 miss ratio : 59.96 % L3 Utilization Efficiency : 54.90 % Within Socket Bandwidth : 3.8750 GB/s Inter Socket Bandwidth : 5.8249 GB/s I/O === Calls : 940 Read : 360 bytes Written : 490 bytes Time : 0.0446 s Top IO calls by time Function Time(s) Calls Read(bytes) Written(bytes) mkstemps 0.0139 50 0 0 mkstemp 0.0118 40 0 0 write 0.0057 70 0 490 creat 0.0043 20 0 0 open 0.0037 230 0 0 openat 0.0031 30 0 0 close 0.0009 370 0 0 read 0.0008 60 360 0 fopen 0.0003 60 0 0 fileno 0.0000 10 0 0 Function Wait(%) Avg(ms) Call Count Data Size(bytes) MPI_Recv 0.00 0.1256 9 3686400 MPI_Send 71.39 5000.5209 9 3686400 Function Caller location File name Time(s) Calls Rea d(bytes) Written(bytes) close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds12_2963699/dstore_sm.lock 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds12_2963699/initial-pmix_shared-segment-0 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/initial-pmix_shared-segment-0 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/smdataseg-3897622529-0 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/smlockseg-3897622529 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/smseg-3897622529-0 0.0000 10 0 0 close closeWrapper+0x24@unknown:0 /home/cfn 0.0001 30 0 0 close closeWrapper+0x24@unknown:0 example.txt 0.0003 140 0 0 close closeWrapper+0x24@unknown:0 example1.txt 0.0001 20 0 0 close closeWrapper+0x24@unknown:0 example2.txt 0.0001 30 0 0 close closeWrapper+0x24@unknown:0 tmp_0A4lYpid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_23QXrqid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_2JqORvid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_2NaWGvid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_2PqjXsid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_4I76qf 0.0000 1 0 0
- Output description
- For the detail and summary task collection types, you can run the report command to view the results.
- For the graphic task collection type, you can import the TAR package to the WebUI to view the graphical information. For details about how to import a TAR package, see Task Management.
- For details about the HPC metrics when using the report function, see the WebUI parameter description in Viewing Analysis Results.