命令功能

采集系统的PMU事件并配合采集面向OpenMP和MPI应用的关键指标，精准获得Parallel region及Barrier-to-Barrier的串行及并行时间、校准的2层微架构指标、指令分布及L3的利用率和内存带宽等相关信息。

多节点场景下，工具需解压或安装在共享目录下。多节点时需自行指定节点，即-H参数添加节点信息。
已生成TAR包的任务可导入Web界面进行图形可视化的查看，导入详情请参见任务管理中的任务导入部分内容。
需在mpirun命令下运行，命令后需添加应用。
HPC应用分析只支持Kunpeng 920 CPU的物理服务器上使用，且HPC应用任务中Top-Down和DRAM带宽数据需要操作系统内核4.19及以上或patched openEuler 4.14内核及以上。
若mpirun类型为MPICH，需将libmpi.so.12所在路径添加到LD_LIBRARY_PATH中，一般为MPICH安装路径的lib文件夹。
环境MPI（OpenMPI/MPICH）需使能C/C++/Fortran三种语言，以OpenMPI使用GCC编译器为例：--enable-mpi-compatibility CC=gcc CXX=g++ FC=gfortran。
OpenMPI需4.1.6及以上兼容版本，优选4.1.6版本；MPICH需4.3.0及以上兼容版本，优选4.3.0版本。

命令格式

mpirun -n 4 devkit tuner hpc-perf -L summary <command> [<options>]

mpirun原始命令为：mpirun -n 4 <command> [<options>]
示例均为RPM包安装使用，若使用压缩包方式需使用绝对路径，如“/home/DevKit-CLI-x.x.x-Linux-Kunpeng/devkit tuner”。
若使用普通用户运行命令，须先确保任务的计算节点/proc/sys/kernel/perf_event_paranoid的值为-1，/proc/sys/kernel/kptr_restrict的值为0。

参数说明

表1 参数说明
参数	参数选项	说明
-h/--help	-	可选参数，获取帮助信息。
-o/--output	-	可选参数，设置生成的数据文件名称。默认生成在当前所在目录，采集集群时默认为rank0指定的目录（此目录在rank0上可访问，若不可访问则在工具所在目录）。
-l/--log-level	0/1/2/3	可选参数，设置日志级别，默认为1。 0：日志级别为DEBUG。 1：日志级别为INFO。 2：日志级别为WARNING。 3：日志级别为ERROR。
-d/--duration	-	可选参数，设置采集时长，单位为秒，默认为一直采集。
-L/--profile-level	summary/detail/graphic	可选参数，设置任务采集类型，默认为summary。 summary为采集基础指标，采集分析开销较小。 detail为采集热点函数及详细指标，采集分析开销较大。 graphic为采集通信热力图等信息。
-D/--delay	-	可选参数，设置延迟采集时长，默认为0秒，适用任务采集类型为summary和detail。
--topn	-	可选参数，设置是否采集低效通讯TopN，适用任务采集类型为graphic。
--critical-path	-	可选参数，设置是否采集Critical Path，适用任务采集类型为summary和detail。
--mpi-only	-	可选参数，设置是否只采集MPI指标，适用任务采集类型为summary和detail。
--call-stack	-	可选参数，设置是否采集CallStack信息，适用任务采集类型为graphic。
--rank-fuzzy	-	可选参数，指定模糊化倍率，默认为12800。适用任务采集类型为graphic。
--region-max	>1000	可选参数，指定时序图中显示的通信区域的数量，默认值为1000，设置时该值需大于1000。适用任务采集类型为graphic。
--rdma-collect	1-15	可选参数，指定收集RDMA性能指标的采集间隔，若未指定该参数将不采集RDMA性能指标，范围为1到15秒。适用任务采集类型为graphic。
--shared-storage	1-15	可选参数，指定收集共享存储性能指标的采集间隔，若未指定该参数将不采集共享存储性能指标，范围为1到15秒。适用任务采集类型为graphic。

使用示例

任务类型为summary，应用为/opt/test/testdemo/ring，采集基础指标。

/opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf /opt/test/testdemo/ring

-L参数不指定默认采集summary等级的数据。

返回信息如下：

[Rank000][localhost.localdomain] =======================================PARAM CHECK=======================================
[Rank000][localhost.localdomain] PARAM CHECK success.
[Rank000][localhost.localdomain] ===================================PREPARE FOR COLLECT===================================
[Rank000][localhost.localdomain] preparation of collection success.
[Rank000][localhost.localdomain] ==================================COLLECT AND ANALYSIS ==================================
Collection duration: 1.00 s, collect until application finish
Collection duration: 2.01 s, collect until application finish
Collection duration: 3.01 s, collect until application finish
...
...
...
Time measured: 0.311326 seconds.
Collection duration: 4.01 s, collect until application finish
Collection duration: 5.02 s, collect until application finish
Collection duration: 6.02 s, collect until application finish
done
Resolving symbols...done
Symbols reduction...done
Calculating MPI imbalance...0.0011 sec
Aggregating MPI/OpenMP data...done
Processing hardware events data started
Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Collection duration: 7.03 s, collect until application finish
0.173 sec
Sorting samples...0.173 sec
Sorting samples...0.000 sec
Loading MPI critical path segments...0.000 sec
Sorting MPI critical path segments...0.000 sec
Aggregating samples...0.000 sec
...
...
...
Raw collection data is stored in /tmp/.devkit_3b9014edeb20b0ed674a9121f1996fb0/TMP_HPCTOOL_DATA/my_raw_data.v1_19.mpirun-3204841473
Issue#1: High CPI value (0.67), ideal value is 0.25. It indicates non-efficient
  CPU MicroArchitecture usage. Possible solutions:
  1. Top-down  MicroArchitecture  tree  shows  high value of BackEnd Bound/Core
     Bound metric (0.62).
Issue#2: CPU  under-utilization  -  inappropriate  amount  of  ranks.  Possible
  solutions:
  1. Consider  increasing total amount of MPI ranks from 10 to 128 using mpirun
     -n  option,  or  try  to  parallelize code with both MPI and OpenMP so the
     number  of  processes(ranks  * OMP_NUM_THREADS) will be equal to CPUs(128)
     count.
Issue#3: High  Inter Socket Bandwidth value (5.82 GB/s). Average DRAM Bandwidth
  is 46.25 GB/s. Possible solutions:
  1. Consider   allocating   memory   on   the  same  NUMA  node  it  is  used.
HINT:  Consider  re-running  collection  with  -l  detail  option  to  get more
  information about microarchitecture related issues.
The report /home/hpc-perf/hpc-perf-20240314-110009.tar is generated successfully
To view the summary report,you can run: devkit report -i /home/hpc-perf/hpc-perf-20240314-110009.tar
To view the detail  report,you can import the report to WebUI or IDE
[Rank000][localhost.localdomain] =====================================RESTORE CLUSTER=====================================
[Rank000][localhost.localdomain] restore cluster success.
[Rank000][localhost.localdomain] =========================================FINISH =========================================

任务类型为detail（普通），应用为/opt/test/testdemo/ring。
1
/opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf -L detail -o /home/hpc-perf-detail.tar -l 0 -d 20 -D 15 /opt/test/testdemo/ring
-o参数指定生成报告压缩包路径，-l参数指定日志等级，0表示为DEBUG，-d 20表明采集20秒数据，-D 15表示延迟15秒再进行采集（即应用运行15秒后开始采集）。上述参数支持所有等级任务采集。
任务类型为detail（多参数），应用为/opt/test/testdemo/ring。
1
/opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf -L detail -o /home/hpc-perf-detail.tar -l 0 -d 20 -D 15 --mpi-only --critical-path /opt/test/testdemo/ring
--mpi-only可减少OMP的采集分析，提高性能， --critical-path采集关键路径信息。上述参数支持summary和detail等级任务采集。

任务类型为graphic，应用为/opt/test/testdemo/ring，采集通信热力图信息。

/opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1  devkit tuner hpc-perf -L graphic -d 300 --topn --call-stack --rank-fuzzy 128 --region-max 1000 --rdma-collect 1 --shared-storage 1 /opt/test/testdemo/ring

--call-stack采集call stack数据， --rank-fuzzy调整热力图模糊化倍率， --region-max调整时序图region数量，--rdma-collect 1为一秒采集一次RDMA数据，--shared-storage 1为一秒采集一次共享存储数据。上述参数支持graphic等级任务采集。

返回信息如下：

[Rank000][localhost.localdomain] =======================================PARAM CHECK=======================================
[Rank000][localhost.localdomain] PARAM CHECK success.
[Rank000][localhost.localdomain] ===================================PREPARE FOR COLLECT===================================
[Rank000][localhost.localdomain] preparation of collection success.
[Rank000][localhost.localdomain] ==================================COLLECT AND ANALYSIS ==================================
loop time :2
send message from rank 4 to rank 9
send message from rank 4 to rank 9
loop time :2
...
...
...
send message from rank 5 to rank 0
Time measured: 0.310607 seconds.
barrier rank 8
send message from rank 8 to rank 3
Time measured: 0.310637 seconds.
Finish collection.Progress 100%.
Postprocessing OTF2 trace(s)...
Successful
OTF2 traces are stored in /tmp/.devkit_1d8ffdfa224a7372beceb19b52a8c510/TMP_HPCTOOL_DATA/my_result.v1_19.rank*.otf2
Raw collection data is stored in /tmp/.devkit_1d8ffdfa224a7372beceb19b52a8c510/TMP_HPCTOOL_DATA/my_raw_data.v1_19.mpirun-2708930561
4 region has been processed, cost time 0.19 s.
The report /home/hpc-perf/hpc-perf-20240314-110450.tar is generated successfully
To view the detail report, you can import the report to the WebUI or IDE
[Rank000][localhost.localdomain] =====================================RESTORE CLUSTER=====================================
[Rank000][localhost.localdomain] restore cluster success.
[Rank000][localhost.localdomain] =========================================FINISH =========================================

使用report功能查看任务报告。

devkit report -i /home/hpc-perf/hpc-perf-xxxxxxx-xxxxxxx.tar

返回信息如下：

Elapsed Time                          :               0.3161 s (rank 000, jitter = 0.0002s)
CPU Utilization                       :                 7.64 % (9.78 out of 128 CPUs)
  Effective Utilization               :                 7.64 % (9.78 out of 128 CPUs)
  Spinning                            :                 0.00 % (0.00 out of 128 CPUs)
  Overhead                            :                 0.00 % (0.00 out of 128 CPUs)
  Cycles per Instruction (CPI)        :               0.6718
  Instructions Retired                :          11961967858
MPI
===
  MPI Wait Rate                       :                 0.20 %
    Imbalance Rate                    :                 0.14 %
    Transfer Rate                     :                 0.06 %
    Top Waiting MPI calls
    Function     Caller location                         Wait(%)  Imb(%)  Transfer(%)
    ---------------------------------------------------------------------------------
    MPI_Barrier  main@test_time_loop.c:72                   0.17    0.13         0.03
    MPI_Recv     first_communicate@test_time_loop.c:17      0.02    0.01         0.01
    MPI_Send     first_communicate@test_time_loop.c:14      0.01       0         0.01
    MPI_Send     second_communicate@test_time_loop.c:31     0.00       0         0.00
    MPI_Recv     second_communicate@test_time_loop.c:28     0.00    0.00         0.00
  Top Hotspots on MPI Critical Path
  Function      Module     CPU Time(s)  Inst Retired     CPI
  ----------------------------------------------------------
  do_n_multpli  test_loop       0.3101    1189348874  0.6780
  Top MPI Critical Path Segments
  MPI Critical Path segment                                        Elapsed Time(s)  CPU Time(s)  Inst Retired     CPI
  -------------------------------------------------------------------------------------------------------------------
  MPI_Send@test_time_loop.c:14 to MPI_Barrier@test_time_loop.c:72           0.3158       0.3101    1189348874  0.6780
  MPI_Send@test_time_loop.c:14 to MPI_Send@test_time_loop.c:14              0.0000
Instruction Mix
===============
  Memory                              :                58.20 %
  Integer                             :                25.11 %
  Floating Point                      :                 0.00 %
  Advanced SIMD                       :                 0.00 %
  Not Retired                         :                 0.00 %
Top-down
========
  Retiring                            :                37.21 %
  Backend Bound                       :                62.85 %
    Memory Bound                      :                 0.56 %
      L1 Bound                        :                 0.55 %
      L2 Bound                        : value is out of range likely because of not enough samples collected
      L3 or DRAM Bound                :                 0.01 %
      Store Bound                     :                 0.00 %
    Core Bound                        :                62.30 %
  Frontend Bound                      :                 0.02 %
  Bad Speculation                     :                 0.00 %
Memory subsystem
================
  Average DRAM Bandwidth              :              46.2486 GB/s
    Read                              :              30.5352 GB/s
    Write                             :              15.7134 GB/s
  L3 By-Pass ratio                    :                20.23 %
  L3 miss ratio                       :                59.96 %
  L3 Utilization Efficiency           :                54.90 %
  Within Socket Bandwidth             :               3.8750 GB/s
  Inter Socket Bandwidth              :               5.8249 GB/s
I/O
===
  Calls                               :                  940
  Read                                :                  360 bytes
  Written                             :                  490 bytes
  Time                                :               0.0446 s
  Top IO calls by time
  Function  Time(s)  Calls  Read(bytes)  Written(bytes)
  mkstemps   0.0139     50            0               0
  mkstemp    0.0118     40            0               0
  write      0.0057     70            0             490
  creat      0.0043     20            0               0
  open       0.0037    230            0               0
  openat     0.0031     30            0               0
  close      0.0009    370            0               0
  read       0.0008     60          360               0
  fopen      0.0003     60            0               0
  fileno     0.0000     10            0               0
Function  Wait(%)    Avg(ms)  Call Count  Data Size(bytes)
MPI_Recv     0.00     0.1256           9           3686400
MPI_Send    71.39  5000.5209           9           3686400
Function  Caller location                                                             File name                                                                                Time(s)  Calls  Rea       d(bytes)  Written(bytes)
close     _mmap_segment_attach+0x6c@unknown:0                                         /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds12_2963699/dstore_sm.lock                  0.0000     10                   0               0
close     _mmap_segment_attach+0x6c@unknown:0                                         /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds12_2963699/initial-pmix_shared-segment-0   0.0000     10                   0               0
close     _mmap_segment_attach+0x6c@unknown:0                                         /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/initial-pmix_shared-segment-0   0.0000     10                   0               0
close     _mmap_segment_attach+0x6c@unknown:0                                         /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/smdataseg-3897622529-0          0.0000     10                   0               0
close     _mmap_segment_attach+0x6c@unknown:0                                         /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/smlockseg-3897622529            0.0000     10                   0               0
close     _mmap_segment_attach+0x6c@unknown:0                                         /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/smseg-3897622529-0              0.0000     10                   0               0
close     closeWrapper+0x24@unknown:0                                                 /home/cfn                                                                                 0.0001     30                   0               0
close     closeWrapper+0x24@unknown:0                                                 example.txt                                                                               0.0003    140                   0               0
close     closeWrapper+0x24@unknown:0                                                 example1.txt                                                                              0.0001     20                   0               0
close     closeWrapper+0x24@unknown:0                                                 example2.txt                                                                              0.0001     30                   0               0
close     closeWrapper+0x24@unknown:0                                                 tmp_0A4lYpid                                                                              0.0000      1                   0               0
close     closeWrapper+0x24@unknown:0                                                 tmp_23QXrqid                                                                              0.0000      1                   0               0
close     closeWrapper+0x24@unknown:0                                                 tmp_2JqORvid                                                                              0.0000      1                   0               0
close     closeWrapper+0x24@unknown:0                                                 tmp_2NaWGvid                                                                              0.0000      1                   0               0
close     closeWrapper+0x24@unknown:0                                                 tmp_2PqjXsid                                                                              0.0000      1                   0               0
close     closeWrapper+0x24@unknown:0                                                 tmp_4I76qf                                                                                0.0000      1                   0               0

输出说明。
1. detail和summary类型任务，可使用report查看结果。
2. graphic类型的任务，可将生成TAR包导入Web界面查看图形化信息；导入详情请参见任务管理中的任务导入部分内容。
3. report功能查看的HPC指标详情可参考Web界面参数说明。

HPC应用分析

命令功能

命令格式

参数说明

使用示例