微架构分析
基于ARM PMU(Performance Monitor Unit)事件,获得指令在CPU流水线上的运行情况,用户可以有针对性地修改自己的程序,以充分利用当前的硬件资源。
示例教程可参见表1。
命令功能
基于ARM
命令格式
1 | devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-G cgroup_name] [workload workload...] |
devkit tuner top-down [workload workload...]可采集指定应用,命令中[workload workload...]替换为应用路径和应用参数。采集时,“-c”、“-p”、“-G”或应用参数中最多只能指定一个。
参数说明
参数 |
参数选项 |
说明 |
|---|---|---|
-h/--help |
- |
可选参数,获取帮助信息。 |
-c/--cpu |
- |
可选参数,指定采集的CPU核心编号,如“0”、“0,1,2”、“0-2”。 |
-d/--duration |
- |
可选参数,设置采集时长,单位为秒,最小值为1秒,默认为一直采集,可使用Ctrl+\取消任务或Ctrl+C停止采集并进入分析。 |
-D/--delay |
- |
可选参数,设置延迟采集时长,单位为秒,默认为0秒,且需小于采集时长。 |
-l/--log-level |
0/1/2/3 |
可选参数,设置日志级别,默认为1。
|
-L/--profile-level |
0/1/2/3/4/5/6 |
可选参数,设置分析指标,默认为0。
|
-i/--interval |
- |
可选参数,设置采集间隔,单位为秒,默认为1秒;若已设置采集时长,需小于等于采集时长。 |
-p/--pid |
- |
可选参数,指定采集的进程 |
-r/--collection-range |
user/kernel/all |
可选参数,设置采集进程的等级,当“-p”/“--pid”设置为ALL时,可以收集内核模式进程或用户模式进程。默认为all(采集用户态和内核态的性能数据)。
|
-G/--cgroup |
- |
可选参数,对指定采集的进程组进行监控和资源控制管理。当前仅支持cgroup v1和cgroup v2。 |
使用示例
- 对CPU采集。
1devkit tuner top-down -c 0-127 -d 3 -L 2
-c 0-127采集0到127的CPU核,-d 3采集时长为3秒,-L 2为采集Back-End Bound->Core Bound指令数据。
返回信息如下:
================================================================================ Version : DevKit xxx CPU Model :xxx Command : devkit tuner top-down -c 0-127 -d 3 -L 2 ================================================================================ TOP-DOWN Summary Report-ALL Time:2026/02/03 19:05:16 ======================================================================= Top-down metrics of CPU(s) 0-127: Cycles 408,642,602,711 Instructions 347,968,271,194 IPC 0.85 ──────────────────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ──────────────────────────────────────────────────────────────────────────────── Bad Speculation 0.16 -- Frontend Bound 2.05 -- Retiring 14.19 inst_retired Backend Bound 83.59 -- ├── Core Bound 36.21 -- │ ├── FDIV Stall 0.00 -- │ ├── DIV Stall 0.00 -- │ ├── FSU Stall 1.12 -- │ ├── Resource Bound* 13.39 -- │ │ ├── Rob_stall* 0.48 -- │ │ ├── Ptag_stall* 7.18 -- │ │ ├── MapQ_stall* 5.72 -- │ │ ├── PCBuf_stall* 0.01 -- │ │ └── Other_stall* 0.00 -- │ └── Exe Ports Util 21.70 -- │ ├── 0 ports serialize 0.47 -- │ ├── 0 ports non serialize 13.99 -- │ ├── 1 ports 0.81 -- │ ├── 2 ports 0.33 -- │ ├── 3 ports 0.64 -- │ ├── 4 ports 5.40 -- │ ├── 5 ports 0.04 -- │ └── 6p ports 0.02 -- └── Memory Bound 47.38 -- ──────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────────── PMU Event Count ──────────────────────────────────────────────────────────────────────────────── r0008 347,968,271,194 r0011 408,642,602,711 r001b 351,964,762,921 r2004 4,757,141,798 r2005 91,300,454 r2006 9,758,187,924 r2007 18,186 r2008 62,112,994,434 r2009 0 r200a 164 r200b 4,558,286,815 r200c 35,479,858,717 r200d 17,243,622,831 r2011 50,301,733,180 r7000 98,402,039,920 r7001 387,268,814,027 r7002 1,216,970 r7003 27,917,649 r7004 8,211,072,317 r7005 219,520,201,751 r7006 1,234,618 r700a 8,177,731,287 r700b 241,370,982,829 r700c 13,916,833,884 r700d 5,613,201,615 r700e 11,099,720,734 r700f 932,53,410,476 r7010 688,907,852 r7011 345,655,294 ──────────────────────────────────────────────────────────────────────────────── 3083 milliseconds time elapsed Metrics marked with '*' indicate approximate values. Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
- 对进程ID采集。
1devkit tuner top-down -p 3716829 -d 3
-p 3716829采集PID为3716829的进程,-d 3采集时长为3秒,未指定参数“-L”则采集所有维度数据。
返回信息如下:
================================================================================ Version : DevKit xxx Command : devkit tuner top-down -p 3716829 -d 3 ================================================================================ TOP-DOWN Summary Report-ALL Time:2026/02/03 19:11:04 ======================================================================= Top-down metrics of process id '3716829': Cycles 565,161,956,085 Instructions 890,812,766,868 IPC 1.58 ──────────────────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ──────────────────────────────────────────────────────────────────────────────── Bad Speculation 0.00 -- ├── Branch Mispredicts 0.00 br_mis_pred │ ├── Indirect Branch 0.00 -- │ ├── Push Branch 0.00 -- │ ├── Pop Branch 0.00 -- │ └── Other Branch 0.00 -- └── Machine Clears 0.00 -- ├── Nuke Flush 0.00 -- └── Other Flush 0.00 -- Frontend Bound 0.87 -- ├── Fetch Latency Bound 0.74 -- │ ├── ITLB Miss 0.06 -- │ ├── ICache Miss 0.62 -- │ ├── BP_Misp_Flush 0.03 br_mis_pred │ ├── OoO Flush 0.01 -- │ └── Static Predictor Flush 0.03 -- └── Fetch Bandwidth Bound 0.13 -- Retiring 26.27 inst_retired Backend Bound 72.86 -- ├── Core Bound 33.15 -- │ ├── FDIV Stall 0.00 -- │ ├── DIV Stall 0.00 -- │ ├── FSU Stall 0.84 -- │ ├── Resource Bound* 11.94 -- │ │ ├── Rob_stall* 0.15 -- │ │ ├── Ptag_stall* 6.40 -- │ │ ├── MapQ_stall* 5.39 -- │ │ ├── PCBuf_stall* 0.00 -- │ │ └── Other_stall* 0.00 -- │ └── Exe Ports Util 20.37 -- │ ├── 0 ports serialize 0.28 -- │ ├── 0 ports non serialize 12.10 -- │ ├── 1 ports 0.62 -- │ ├── 2 ports 0.29 -- │ ├── 3 ports 0.54 -- │ ├── 4 ports 6.49 -- │ ├── 5 ports 0.04 -- │ └── 6p ports 0.02 -- └── Memory Bound 39.70 -- ├── L1 Bound 3.51 -- │ ├── DTLB 0.18 -- │ ├── Misalign 0.53 -- │ ├── Resource Full 0.00 -- │ ├── Instruction Type 0.14 -- │ ├── Forward hazard 0.15 -- │ ├── Structure hazard 1.77 -- │ └── Pipeline 0.74 -- ├── L2 Bound 0.00 -- │ ├── buffer pending 0.00 -- │ ├── snoop pending 0.00 -- │ ├── Arb idle 0.00 -- │ └── Pipeline 0.00 -- ├── L3 or DRAM Bound 36.20 -- └── Store Bound 0.00 -- ├── SCA 0.00 -- ├── Head 0.00 -- ├── Order 0.00 -- └── Other 0.00 -- ──────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────────── PMU Event Count ──────────────────────────────────────────────────────────────────────────────── r0008 890,812,766,869 r0010 42,361,366 r0011 565,161,956,085 r001b 885,884,852,999 r0027 120,670,847 r0028 29,018,877 r002e 647,664 r0030 24,644,752 r100d 5,983,845 r1010 8,632,549 r1013 16,476 r1016 123,653 r104f 34,432,517 r2004 2,603,284,473 r2005 69,323,025 r2006 5,786,237,013 r2007 29,311 r2008 109,031,479,289 r2009 14 r200a 0 r200b 6,304,758,864 r200c 50,709,639,660 r200d 39,671,785,098 r200f 4,573,402 r2010 6,060,229 r2011 29,618,727,432 r2012 4,204,799,697 r5090 498,155,819 r5091 1,505,244,430 r5092 1,397,778 r5093 399,164,697 r5094 432,410,165 r5095 5,041,495,853 r5096 2,097,655,733 r50a0 159,236,956 r50a2 17,038,846,820 r50a3 256,623,465,481 r50a4 245,022,246 r7000 143,896,308,625 r7001 561,721,629,213 r7002 1,740,417 r7003 25,562,410 r7004 10,080,452,487 r7005 306,104,360,719 r7006 5,399,535 r7007 278,926,699,807 r7008 280,522,927,806 r700a 7,776,571,030 r700b 341,315,121,543 r700c 17,521,518,047 r700d 8,248,288,096 r700e 15,281,449,842 r700f 183,091,541,693 r7010 1,132,511,038 r7011 558,813,174 r701e 6,807,528 r701f 95,045,849 r7020 960,519,537 ──────────────────────────────────────────────────────────────────────────────── 3378 milliseconds time elapsed Metrics marked with '*' indicate approximate values. Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] Optimization Suggestions 1. The percentage of Backend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation. (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved. (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940. (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran. (5) Try enabling huge pages.
Preferred Sampling Event展示的是影响微架构bound的关键事件,通过对关键事件调优可以达到优化对应bound的效果;可使用devkit tuner hotspot -e [Preferred Sampling Event]进行分析调优。
- 对应用采集。
1devkit tuner top-down -d 10 -L 2 /opt/testdemo/cache_miss_long
-d 10采集时长为10秒,-L 2为采集Back-End Bound->Core Bound指令数据。
返回信息如下:
================================================================================ Version : DevKit xxx Command : devkit tuner top-down -d 10 -L 2 /opt/testdemo/cache_miss_long ================================================================================ TOP-DOWN Summary Report-ALL Time:2026/02/03 19:14:04 ======================================================================= Top-down metrics of /opt/testdemo/cache_miss_long: Cycles 28,931,970,351 Instructions 12,298,232,508 IPC 0.43 ──────────────────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ──────────────────────────────────────────────────────────────────────────────── Bad Speculation 0.22 -- Frontend Bound 1.52 -- Retiring 7.08 inst_retired Backend Bound 91.18 -- ├── Core Bound 31.48 -- │ ├── FDIV Stall 0.00 -- │ ├── DIV Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ ├── Resource Bound* 20.10 -- │ │ ├── Rob_stall* 0.04 -- │ │ ├── Ptag_stall* 18.33 -- │ │ ├── MapQ_stall* 1.73 -- │ │ ├── PCBuf_stall* 0.00 -- │ │ └── Other_stall* 0.00 -- │ └── Exe Ports Util 11.38 -- │ ├── 0 ports serialize 0.16 -- │ ├── 0 ports non serialize 8.04 -- │ ├── 1 ports 1.54 -- │ ├── 2 ports 0.87 -- │ ├── 3 ports 0.45 -- │ ├── 4 ports 0.21 -- │ ├── 5 ports 0.07 -- │ └── 6p ports 0.03 -- └── Memory Bound 59.70 -- ──────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────────── PMU Event Count ──────────────────────────────────────────────────────────────────────────────── r0008 12,298,232,508 r0011 28,931,970,351 r001b 12,675,652,675 r2004 51,385,954 r2005 5,877,420 r2006 23,237,998,292 r2007 2,054 r2008 0 r2009 0 r200a 0 r200b 2,188,478,662 r200c 0 r200d 0 r2011 2,638,598,816 r7000 17,589,179,668 r7001 28,842,703,274 r7002 0 r7003 151,662 r7004 0 r7005 18,885,819,361 r7006 0 r700a 417,655,851 r700b 20,454,124,794 r700c 3,923,354,625 r700d 2,222,293,173 r700e 1,147,598,469 r700f 523,512,233 r7010 182,140,658 r7011 87,297,422 ──────────────────────────────────────────────────────────────────────────────── 10003 milliseconds time elapsed Metrics marked with '*' indicate approximate values. Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] Optimization Suggestions 1. The percentage of Backend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation. (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved. (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940. (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran. (5) Try enabling huge pages. - 对cgroup采集。
1devkit tuner top-down -d 10 -L 2 -G my_test_cgroup
-d 10采集时长为10秒,-L 2为采集Back-End Bound->Core Bound指令数据,-G my_test_cgroup表示对名为my_test_cgroup的cgroup进行采集。
返回信息如下:
================================================================================ Version : DevKit xxx Command : devkit tuner top-down -d 10 -L 2 -G my_test_cgroup ================================================================================ TOP-DOWN Summary Report-ALL Time:2026/02/03 19:20:10 ======================================================================= Top-down metrics of cgroup: my_test_cgroup: Cycles 28,852,354,047 Instructions 29,575,950,273 IPC 1.03 ──────────────────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ──────────────────────────────────────────────────────────────────────────────── Bad Speculation 47.60 -- Frontend Bound 9.24 -- Retiring 17.08 inst_retired Backend Bound 26.07 -- ├── Core Bound 18.37 -- │ ├── FDIV Stall 0.00 -- │ ├── DIV Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ ├── Resource Bound* 0.97 -- │ │ ├── Rob_stall* 0.00 -- │ │ ├── Ptag_stall* 0.90 -- │ │ ├── MapQ_stall* 0.08 -- │ │ ├── PCBuf_stall* 0.00 -- │ │ └── Other_stall* 0.00 -- │ └── Exe Ports Util 17.40 -- │ ├── 0 ports serialize 0.02 -- │ ├── 0 ports non serialize 4.40 -- │ ├── 1 ports 0.15 -- │ ├── 2 ports 0.81 -- │ ├── 3 ports 2.24 -- │ ├── 4 ports 3.56 -- │ ├── 5 ports 3.47 -- │ └── 6p ports 2.74 -- └── Memory Bound 7.70 -- ──────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────────── PMU Event Count ──────────────────────────────────────────────────────────────────────────────── r0008 29,575,950,273 r0011 28,852,354,047 r001b 111,983,642,709 r2004 4,947,993 r2005 297,829 r2006 3,396,949,558 r2007 291 r2008 0 r2009 0 r200a 0 r200b 284,335,306 r200c 0 r200d 7,780 r2011 16,002,028,769 r7000 958,854,053 r7001 24,280,047,948 r7002 158 r7003 153,511 r7004 1,838,145 r7005 7,167,944,269 r7006 0 r700a 31,687,641 r700b 7,243,637,081 r700c 246,613,685 r700d 1,332,389,662 r700e 3,690,316,653 r700f 5,861,896,332 r7010 5,718,837,368 r7011 4,515,603,055 ──────────────────────────────────────────────────────────────────────────────── 10118 milliseconds time elapsed Metrics marked with '*' indicate approximate values. Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]