基于ARM PMU(Performance Monitor Unit)事件,获得指令在CPU流水线上的运行情况,用户可以有针对性地修改自己的程序,以充分利用当前的硬件资源。
基于ARM PMU事件,获得指令在CPU流水线上的运行情况,快速定位当前应用在CPU上的性能瓶颈。
1 | devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-o] [--package] [workload workload...] |
devkit tuner top-down [workload workload...]可采集指定应用,命令中[workload workload...]替换为应用路径和应用参数;当参数-c/--cpu和参数-p/--pid存在时,优先采集-p参数指定项。
参数 |
参数选项 |
说明 |
---|---|---|
-h/--help |
- |
获取帮助信息。 |
-c/--cpu |
- |
指定采集的CPU核数,如“0”、“0,1,2”、“0-2”。 |
-d/--duration |
- |
设置采集时长,单位为秒,最小值为1秒,默认为一直采集,可使用Ctrl+\取消任务或Ctrl+C停止采集并进入分析。 |
-D/--delay |
- |
设置延迟采集时长,默认为0秒,且需小于采集时长。 |
-i/--interval |
- |
设置采集间隔,默认为1秒;若已设置采集时长,需小于等于采集时长。 |
-l/--log-level |
0/1/2/3 |
设置日志级别,默认为1。
|
-L/--profile-level |
0/1/2/3/4/5/6 |
设置分析指标,默认为0。
|
-o/--output |
- |
设置报告数据压缩包名称和输出路径,仅输入名称时默认生成在当前所在目录;需和--package配合使用。 |
-r/--collection-range |
user/kernel/all |
设置采集进程的等级,当-p/--pid设置为ALL时,可以收集内核模式进程或用户模式进程。默认为all(采集用户态和内核态的性能数据)。
|
-p/--pid |
PID/PID1,PID2/ALL |
指定采集的进程PID,多个进程PID可用“,”分隔,默认采集全部进程(ALL)。若同时使用-p和-c参数则优先采集指定PID的进程。 |
--package |
- |
设置是否生成报告数据压缩包,不指定压缩包名称和路径时默认在当前所在目录生成top-down-时间戳.tar。 |
1 | devkit tuner top-down -c 0-127 -d 3 -o /home/topdown_cpu -L 2 --package |
该命令参数-c 0-127采集0到127的CPU核,采集时长为3秒,参数-o /home/topdown_cpu和--package生成以topdown_cpu命名的报告数据压缩包至指定路径,参数-L 2为采集Back-End Bound->Core Bound指令数据。
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | TOP-DOWN Summary Report-ALL Time:2024/08/06 15:41:27 ======================================================================= Top-down metrics of the system: Cycles 244,796,223 Instructions 138,949,659 IPC 0.57 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 15.31 -- Frontend Bound 40.35 fetch_bubble Retiring 14.19 inst_retired Backend Bound 30.15 -- ├── Resource Bound 4.43 -- ├── Core Bound 15.42 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 15.41 -- │ ├── ALU BRU IssueQ Full 0.61 -- │ ├── LS IssueQ Full 1.14 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 10.26 -- ────────────────────────────────────────────────────────────────── 3009 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] The report /home/topdown_cpu.tar is generated successfully. To view summary report. you can run: devkit report -i /home/topdown_cpu.tar To view detail report. you can import the report to the WebUI or IDE to view details. |
1 | devkit tuner top-down -p 12540 -d 3 -o /home/topdown_pid --package |
该命令参数-p 12540采集PID为12540的进程,采集时长为3秒,参数-o /home/topdown_pid和--package生成以topdown_pid命名的报告数据压缩包至指定路径,未指定参数-L则采集所有维度数据。
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | TOP-DOWN Summary Report-ALL Time:2024/08/06 15:48:48 ======================================================================= Top-down metrics of process id '1884856': Cycles 1,488,556,148 Instructions 1,480,811,195 IPC 0.99 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 55.99 -- ├── Branch Mispredicts 55.86 br_mis_pred │ ├── Indirect Branch 0.00 -- │ ├── Push Branch 0.00 -- │ ├── Pop Branch 0.00 -- │ └── Other Branch 55.86 -- └── Machine Clears 0.12 -- ├── Nuke Flush 0.02 -- └── Other Flush 0.09 -- Frontend Bound 12.48 fetch_bubble ├── Fetch Latency Bound 9.42 -- │ ├── ITLB Miss 0.02 -- │ │ ├── L1 Tlb 0.02 -- │ │ └── L2 Tlb 0.00 l2i_tlb_refill │ ├── ICache Miss 0.43 -- │ │ ├── L1 Cache 0.11 -- │ │ └── L2 Cache 0.32 l2i_cache_refill │ ├── Branch Mispredict Flush 8.91 br_mis_pred │ ├── OoO Flush 0.01 -- │ └── Static Predictor Flush 0.05 -- └── Fetch Bandwidth Bound 3.05 -- Retiring 24.86 inst_retired Backend Bound 6.66 -- ├── Resource Bound 0.07 -- │ ├── Sync Stall 0.00 -- │ ├── Reorder Buffer Stall 0.00 -- │ ├── Physical Tag Stall 0.07 -- │ ├── SaveOp Queue Stall 0.00 -- │ ├── PC Buffer Stall 0.00 -- │ └── Other Stall 0.00 -- ├── Core Bound 4.80 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 4.79 -- │ ├── ALU BRU IssueQ Full 0.03 -- │ ├── LS IssueQ Full 0.17 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 1.77 -- ├── L1 Bound 1.73 -- ├── L2 Bound 0.03 -- ├── L3 or DRAM Bound 0.01 cache-misses └── Store Bound 0.00 -- ────────────────────────────────────────────────────────────────── 3000 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] The report /home/topdown_pid.tar is generated successfully. To view summary report. you can run: devkit report -i /home/topdown_pid.tar To view detail report. you can import the report to the WebUI or IDE to view details. |
Preferred Sampling Event展示的是影响微架构bound的关键事件,通过对关键事件调优可以达到优化对应bound的效果;可使用devkit tuner hotspot -e [Preferred Sampling Event]进行分析调优。
1 | devkit tuner top-down -d 10 -o /home/topdown_app -L 2 --package /opt/testdemo/topdown_suggest |
该命令采集时长为10秒,参数-o /home/topdown_app和--package生成以topdown_app命名的报告数据压缩包至指定路径,参数-L 2为采集Back-End Bound->Core Bound指令数据。
返回信息如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | TOP-DOWN Summary Report-ALL Time:2024/12/25 17:45:45 ======================================================================= Top-down metrics of /opt/testdemo/topdown_suggest: Cycles 25,998,441,500 Instructions 27,700,695,734 IPC 1.07 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 0.01 -- Frontend Bound 25.94 fetch_bubble Retiring 26.63 inst_retired Backend Bound 47.42 -- ├── Resource Bound 7.74 -- ├── Core Bound 31.47 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 31.46 -- │ ├── ALU BRU IssueQ Full 15.67 -- │ ├── LS IssueQ Full 3.11 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 8.20 -- ────────────────────────────────────────────────────────────────── 10000 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] Optimization Suggestions 1. The percentage of Frontend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Set the Inline parameter: -mllvm -inline-threshold=1550 (1550 is an optimal empirical value). You are advised to enable LTO in advance. (2) Adjust the alignment of functions, that of basic blocks, and that of basic blocks without jumping: -mllvm -align-all-functions=2^n -mllvm -align-all-blocks=2^n -mllvm -align-all-nofallthru-blocks=2^n, where the value of 2^n is 32 or 64 and can be changed to 16 or 128 if needed. (3) Enable PGO: -mllvm -enable-split-machine-functions. After PGO is enabled, the compiler splits functions based on the popularity of basic blocks and adjusts the code block layout to optimize the program performance. (4) Enable LTO. There are two types of LTO: full and thin, which correspond to -flto=full and -flto=thin. Full LTO delivers superior performance but requires a longer compilation time. Thin LTO delivers inferior performance but needs a shorter compilation time. To enable LTO for the compilation, add a link time option to the optimization options of 1) to 3). For example, for -mllvm -enable-split-machine-functions, prefix it with -fuse-ld=lld -Wl, that is, -fuse-ld=lld -Wl,-mllvm,-enable-split-machine-functions. 2. The percentage of Backend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation. (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved. (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940. (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran. (5) Try enabling huge pages. The report /home/topdown_app.tar is generated successfully. To view summary report. you can run: devkit report -i /home/topdown_app.tar To view detail report. you can import the report to the WebUI or IDE to view details. |
以上界面回显为微架构分析任务的总览信息可通过--package参数打包生成TAR包,导入Web界面查看图形化信息;导入详情请参见任务管理中的任务导入部分内容。