高性能计算(HPC)是一种利用强大处理器集群并行处理海量多维数据集(也称为大数据)并以极高速度解决复杂问题的技术。命令对于不同资源开销的场景提供多种任务模式采集分析HPC应用的关键指标,并给出优化建议帮助用户提升程序性能。
采集系统的PMU事件并配合采集面向OpenMP和MPI应用的关键指标,精准获得Parallel region及Barrier-to-Barrier的串行及并行时间、校准的2层微架构指标、指令分布及L3的利用率和内存带宽等相关信息。
1 | mpirun -n 4 devkit tuner hpc-perf -L summary <command> [<options>] |
参数 |
参数选项 |
说明 |
---|---|---|
-h/--help |
- |
获取帮助信息。 |
-o/--output |
- |
设置生成的数据文件名称。默认生成在当前所在目录,采集集群时默认为rank0指定的目录(此目录在rank0上可访问,若不可访问则在工具所在目录)。 |
-l/--log-level |
0/1/2/3 |
设置日志级别,默认为1。
|
-d/--duration |
- |
设置采集时长,单位为秒,默认为一直采集。 |
-L/--profile-level |
summary/detail/graphic |
设置任务采集类型,默认为summary。
|
-D/--delay |
- |
设置延迟采集时长,默认为0秒,适用任务采集类型为summary和detail。 |
--topn |
- |
设置是否采集低效通讯TopN,适用任务采集类型为graphic。 |
--critical-path |
- |
设置是否采集Critical Path,适用任务采集类型为summary和detail。 |
--mpi-only |
- |
设置是否只采集MPI指标,适用任务采集类型为summary和detail。 |
--call-stack |
- |
设置是否采集CallStack信息,适用任务采集类型为graphic。 |
--rank-fuzzy |
- |
指定模糊化倍率,默认为12800。适用任务采集类型为graphic。 |
--region-max |
>1000 |
指定时序图中显示的通信区域的数量,默认值为1000,设置时该值需大于1000。适用任务采集类型为graphic。 |
--rdma-collect |
1-15 |
指定收集RDMA性能指标的采集间隔,若未指定该参数将不采集RDMA性能指标,范围为1到15秒。适用任务采集类型为graphic。 |
--shared-storage |
1-15 |
指定收集共享存储性能指标的采集间隔,若未指定该参数将不采集共享存储性能指标,范围为1到15秒。适用任务采集类型为graphic。 |
1 | /opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf /opt/test/testdemo/ring |
-L参数不指定默认采集summary等级的数据。
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | [Rank000][localhost.localdomain] =======================================PARAM CHECK======================================= [Rank000][localhost.localdomain] PARAM CHECK success. [Rank000][localhost.localdomain] ===================================PREPARE FOR COLLECT=================================== [Rank000][localhost.localdomain] preparation of collection success. [Rank000][localhost.localdomain] ==================================COLLECT AND ANALYSIS ================================== Collection duration: 1.00 s, collect until application finish Collection duration: 2.01 s, collect until application finish Collection duration: 3.01 s, collect until application finish ... ... ... Time measured: 0.311326 seconds. Collection duration: 4.01 s, collect until application finish Collection duration: 5.02 s, collect until application finish Collection duration: 6.02 s, collect until application finish done Resolving symbols...done Symbols reduction...done Calculating MPI imbalance...0.0011 sec Aggregating MPI/OpenMP data...done Processing hardware events data started Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Reading perf trace...Collection duration: 7.03 s, collect until application finish 0.173 sec Sorting samples...0.173 sec Sorting samples...0.000 sec Loading MPI critical path segments...0.000 sec Sorting MPI critical path segments...0.000 sec Aggregating samples...0.000 sec ... ... ... Raw collection data is stored in /tmp/.devkit_3b9014edeb20b0ed674a9121f1996fb0/TMP_HPCTOOL_DATA/my_raw_data.v1_19.mpirun-3204841473 Issue#1: High CPI value (0.67), ideal value is 0.25. It indicates non-efficient CPU MicroArchitecture usage. Possible solutions: 1. Top-down MicroArchitecture tree shows high value of BackEnd Bound/Core Bound metric (0.62). Issue#2: CPU under-utilization - inappropriate amount of ranks. Possible solutions: 1. Consider increasing total amount of MPI ranks from 10 to 128 using mpirun -n option, or try to parallelize code with both MPI and OpenMP so the number of processes(ranks * OMP_NUM_THREADS) will be equal to CPUs(128) count. Issue#3: High Inter Socket Bandwidth value (5.82 GB/s). Average DRAM Bandwidth is 46.25 GB/s. Possible solutions: 1. Consider allocating memory on the same NUMA node it is used. HINT: Consider re-running collection with -l detail option to get more information about microarchitecture related issues. The report /home/hpc-perf/hpc-perf-20240314-110009.tar is generated successfully To view the summary report,you can run: devkit report -i /home/hpc-perf/hpc-perf-20240314-110009.tar To view the detail report,you can import the report to WebUI or IDE [Rank000][localhost.localdomain] =====================================RESTORE CLUSTER===================================== [Rank000][localhost.localdomain] restore cluster success. [Rank000][localhost.localdomain] =========================================FINISH ========================================= |
1 | /opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf -L detail -o /home/hpc-perf-detail.tar -l 0 -d 20 -D 15 /opt/test/testdemo/ring |
-o参数指定生成报告压缩包路径,-l参数指定日志等级,0表示为DEBUG,-d 20表明采集20秒数据,-D 15表示延迟15秒再进行采集(即应用运行15秒后开始采集)。上述参数支持所有等级任务采集。
1 | /opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf -L detail -o /home/hpc-perf-detail.tar -l 0 -d 20 -D 15 --mpi-only --critical-path /opt/test/testdemo/ring |
--mpi-only可减少OMP的采集分析,提高性能, --critical-path采集关键路径信息。上述参数支持summary和detail等级任务采集。
1 | /opt/ompi/bin/mpirun --allow-run-as-root -H 1.2.3.4:1 devkit tuner hpc-perf -L graphic -d 300 --topn --call-stack --rank-fuzzy 128 --region-max 1000 --rdma-collect 1 --shared-storage 1 /opt/test/testdemo/ring |
--call-stack采集call stack数据, --rank-fuzzy调整热力图模糊化倍率, --region-max调整时序图region数量,--rdma-collect 1为一秒采集一次RDMA数据,--shared-storage 1为一秒采集一次共享存储数据。上述参数支持graphic等级任务采集。
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | [Rank000][localhost.localdomain] =======================================PARAM CHECK======================================= [Rank000][localhost.localdomain] PARAM CHECK success. [Rank000][localhost.localdomain] ===================================PREPARE FOR COLLECT=================================== [Rank000][localhost.localdomain] preparation of collection success. [Rank000][localhost.localdomain] ==================================COLLECT AND ANALYSIS ================================== loop time :2 send message from rank 4 to rank 9 send message from rank 4 to rank 9 loop time :2 ... ... ... send message from rank 5 to rank 0 Time measured: 0.310607 seconds. barrier rank 8 send message from rank 8 to rank 3 Time measured: 0.310637 seconds. Finish collection.Progress 100%. Postprocessing OTF2 trace(s)... Successful OTF2 traces are stored in /tmp/.devkit_1d8ffdfa224a7372beceb19b52a8c510/TMP_HPCTOOL_DATA/my_result.v1_19.rank*.otf2 Raw collection data is stored in /tmp/.devkit_1d8ffdfa224a7372beceb19b52a8c510/TMP_HPCTOOL_DATA/my_raw_data.v1_19.mpirun-2708930561 4 region has been processed, cost time 0.19 s. The report /home/hpc-perf/hpc-perf-20240314-110450.tar is generated successfully To view the detail report, you can import the report to the WebUI or IDE [Rank000][localhost.localdomain] =====================================RESTORE CLUSTER===================================== [Rank000][localhost.localdomain] restore cluster success. [Rank000][localhost.localdomain] =========================================FINISH ========================================= |
1 | devkit report -i /home/hpc-perf/hpc-perf-xxxxxxx-xxxxxxx.tar |
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | Elapsed Time : 0.3161 s (rank 000, jitter = 0.0002s) CPU Utilization : 7.64 % (9.78 out of 128 CPUs) Effective Utilization : 7.64 % (9.78 out of 128 CPUs) Spinning : 0.00 % (0.00 out of 128 CPUs) Overhead : 0.00 % (0.00 out of 128 CPUs) Cycles per Instruction (CPI) : 0.6718 Instructions Retired : 11961967858 MPI === MPI Wait Rate : 0.20 % Imbalance Rate : 0.14 % Transfer Rate : 0.06 % Top Waiting MPI calls Function Caller location Wait(%) Imb(%) Transfer(%) --------------------------------------------------------------------------------- MPI_Barrier main@test_time_loop.c:72 0.17 0.13 0.03 MPI_Recv first_communicate@test_time_loop.c:17 0.02 0.01 0.01 MPI_Send first_communicate@test_time_loop.c:14 0.01 0 0.01 MPI_Send second_communicate@test_time_loop.c:31 0.00 0 0.00 MPI_Recv second_communicate@test_time_loop.c:28 0.00 0.00 0.00 Top Hotspots on MPI Critical Path Function Module CPU Time(s) Inst Retired CPI ---------------------------------------------------------- do_n_multpli test_loop 0.3101 1189348874 0.6780 Top MPI Critical Path Segments MPI Critical Path segment Elapsed Time(s) CPU Time(s) Inst Retired CPI ------------------------------------------------------------------------------------------------------------------- MPI_Send@test_time_loop.c:14 to MPI_Barrier@test_time_loop.c:72 0.3158 0.3101 1189348874 0.6780 MPI_Send@test_time_loop.c:14 to MPI_Send@test_time_loop.c:14 0.0000 Instruction Mix =============== Memory : 58.20 % Integer : 25.11 % Floating Point : 0.00 % Advanced SIMD : 0.00 % Not Retired : 0.00 % Top-down ======== Retiring : 37.21 % Backend Bound : 62.85 % Memory Bound : 0.56 % L1 Bound : 0.55 % L2 Bound : value is out of range likely because of not enough samples collected L3 or DRAM Bound : 0.01 % Store Bound : 0.00 % Core Bound : 62.30 % Frontend Bound : 0.02 % Bad Speculation : 0.00 % Memory subsystem ================ Average DRAM Bandwidth : 46.2486 GB/s Read : 30.5352 GB/s Write : 15.7134 GB/s L3 By-Pass ratio : 20.23 % L3 miss ratio : 59.96 % L3 Utilization Efficiency : 54.90 % Within Socket Bandwidth : 3.8750 GB/s Inter Socket Bandwidth : 5.8249 GB/s I/O === Calls : 940 Read : 360 bytes Written : 490 bytes Time : 0.0446 s Top IO calls by time Function Time(s) Calls Read(bytes) Written(bytes) mkstemps 0.0139 50 0 0 mkstemp 0.0118 40 0 0 write 0.0057 70 0 490 creat 0.0043 20 0 0 open 0.0037 230 0 0 openat 0.0031 30 0 0 close 0.0009 370 0 0 read 0.0008 60 360 0 fopen 0.0003 60 0 0 fileno 0.0000 10 0 0 Function Wait(%) Avg(ms) Call Count Data Size(bytes) MPI_Recv 0.00 0.1256 9 3686400 MPI_Send 71.39 5000.5209 9 3686400 Function Caller location File name Time(s) Calls Rea d(bytes) Written(bytes) close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds12_2963699/dstore_sm.lock 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds12_2963699/initial-pmix_shared-segment-0 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/initial-pmix_shared-segment-0 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/smdataseg-3897622529-0 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/smlockseg-3897622529 0.0000 10 0 0 close _mmap_segment_attach+0x6c@unknown:0 /tmp/ompi.localhost.0/pid.2963699/pmix_dstor_ds21_2963699/smseg-3897622529-0 0.0000 10 0 0 close closeWrapper+0x24@unknown:0 /home/cfn 0.0001 30 0 0 close closeWrapper+0x24@unknown:0 example.txt 0.0003 140 0 0 close closeWrapper+0x24@unknown:0 example1.txt 0.0001 20 0 0 close closeWrapper+0x24@unknown:0 example2.txt 0.0001 30 0 0 close closeWrapper+0x24@unknown:0 tmp_0A4lYpid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_23QXrqid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_2JqORvid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_2NaWGvid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_2PqjXsid 0.0000 1 0 0 close closeWrapper+0x24@unknown:0 tmp_4I76qf 0.0000 1 0 0 |
1、detail和summary类型任务,可使用report查看结果。
2、graphic类型的任务,可在Web导入查看图形化结果。
3、report功能查看的HPC指标详情可参考Web界面参数说明。