Microarchitecture Analysis
Based on Arm performance monitor unit (PMU) events, you can learn the running status of instructions on the CPU pipeline. You can modify your application accordingly to make full use of your hardware resources.
Table 1 provides the tutorial.
Command Function
Analyzes the running status of instructions on the CPU pipeline based on Arm PMU events, helping quickly locate performance bottlenecks of the current application on the CPUs.
Syntax
1 | devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-G cgroup_name] [workload workload...] |
devkit tuner top-down [workload workload...] can be used to collect data of a specified application. Replace [workload workload...] in the command with the application path and application parameter. Only one of -c, -p, -G, and the application parameter can be specified.
Parameter Description
Parameter |
Option |
Description |
|---|---|---|
-h/--help |
- |
Obtains help information. This parameter is optional. |
-c/--cpu |
- |
Numbers of CPU cores to be collected, for example, 0, 0,1,2, and 0-2. This parameter is optional. |
-d/--duration |
- |
Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. This parameter is optional. |
-D/--delay |
- |
Collection delay, which defaults to 0, in seconds, and must be less than the collection duration. This parameter is optional. |
-l/--log-level |
0/1/2/3 |
Log level, which defaults to 1. This parameter is optional.
|
-L/--profile-level |
0/1/2/3/4/5/6 |
Analysis metric, which defaults to 0. This parameter is optional.
|
-i/--interval |
- |
Collection interval, which defaults to 1, in seconds. If the collection duration is set, the collection interval must be less than or equal to the configured collection duration. This parameter is optional. |
-p/--pid |
- |
ID of a process to be collected. Separate multiple PIDs with commas (,). This parameter is optional. |
-r/--collection-range |
user/kernel/all |
Process collection level. When -p/--pid is set to ALL, the option user or kernel can be selected, which means that user-mode processes or kernel-mode processes can be collected. This parameter is optional. The default value is all, which collects user-mode and kernel-mode performance data.
|
-G/--cgroup |
- |
Monitors the specified process group and manages its resources. Only cgroup v1 and cgroup v2 are supported. |
Example
- Collect CPU data.
1devkit tuner top-down -c 0-127 -d 3 -L 2
The -c 0-127 parameter indicates that CPU cores 0 to 127 are collected. The -d 3 parameter indicates that the collection duration is 3 seconds. The -L 2 parameter indicates that the Back-End Bound -> Core Bound instruction data is collected.
Command output:
================================================================================ Version : DevKit xxx CPU Model : xxx Command : devkit tuner top-down -c 0-127 -d 3 -L 2 ================================================================================ TOP-DOWN Summary Report-ALL Time:2026/02/03 19:05:16 ======================================================================= Top-down metrics of CPU(s) 0-127: Cycles 408,642,602,711 Instructions 347,968,271,194 IPC 0.85 ──────────────────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ──────────────────────────────────────────────────────────────────────────────── Bad Speculation 0.16 -- Frontend Bound 2.05 -- Retiring 14.19 inst_retired Backend Bound 83.59 -- ├── Core Bound 36.21 -- │ ├── FDIV Stall 0.00 -- │ ├── DIV Stall 0.00 -- │ ├── FSU Stall 1.12 -- │ ├── Resource Bound* 13.39 -- │ │ ├── Rob_stall* 0.48 -- │ │ ├── Ptag_stall* 7.18 -- │ │ ├── MapQ_stall* 5.72 -- │ │ ├── PCBuf_stall* 0.01 -- │ │ └── Other_stall* 0.00 -- │ └── Exe Ports Util 21.70 -- │ ├── 0 ports serialize 0.47 -- │ ├── 0 ports non serialize 13.99 -- │ ├── 1 ports 0.81 -- │ ├── 2 ports 0.33 -- │ ├── 3 ports 0.64 -- │ ├── 4 ports 5.40 -- │ ├── 5 ports 0.04 -- │ └── 6p ports 0.02 -- └── Memory Bound 47.38 -- ──────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────────── PMU Event Count ──────────────────────────────────────────────────────────────────────────────── r0008 347,968,271,194 r0011 408,642,602,711 r001b 351,964,762,921 r2004 4,757,141,798 r2005 91,300,454 r2006 9,758,187,924 r2007 18,186 r2008 62,112,994,434 r2009 0 r200a 164 r200b 4,558,286,815 r200c 35,479,858,717 r200d 17,243,622,831 r2011 50,301,733,180 r7000 98,402,039,920 r7001 387,268,814,027 r7002 1,216,970 r7003 27,917,649 r7004 8,211,072,317 r7005 219,520,201,751 r7006 1,234,618 r700a 8,177,731,287 r700b 241,370,982,829 r700c 13,916,833,884 r700d 5,613,201,615 r700e 11,099,720,734 r700f 932,53,410,476 r7010 688,907,852 r7011 345,655,294 ──────────────────────────────────────────────────────────────────────────────── 3083 milliseconds time elapsed Metrics marked with '*' indicate approximate values. Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
- Collect process IDs.
1devkit tuner top-down -p 3716829 -d 3
The -p 3716829 parameter indicates that the process with PID 3716829 is collected. The -d 3 parameter indicates that the collection duration is 3 seconds. If the -L parameter is not specified, data of all dimensions is collected.
Command output:
================================================================================ Version : DevKit xxx Command : devkit tuner top-down -p 3716829 -d 3 ================================================================================ TOP-DOWN Summary Report-ALL Time:2026/02/03 19:11:04 ======================================================================= Top-down metrics of process id '3716829': Cycles 565,161,956,085 Instructions 890,812,766,868 IPC 1.58 ──────────────────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ──────────────────────────────────────────────────────────────────────────────── Bad Speculation 0.00 -- ├── Branch Mispredicts 0.00 br_mis_pred │ ├── Indirect Branch 0.00 -- │ ├── Push Branch 0.00 -- │ ├── Pop Branch 0.00 -- │ └── Other Branch 0.00 -- └── Machine Clears 0.00 -- ├── Nuke Flush 0.00 -- └── Other Flush 0.00 -- Frontend Bound 0.87 -- ├── Fetch Latency Bound 0.74 -- │ ├── ITLB Miss 0.06 -- │ ├── ICache Miss 0.62 -- │ ├── BP_Misp_Flush 0.03 br_mis_pred │ ├── OoO Flush 0.01 -- │ └── Static Predictor Flush 0.03 -- └── Fetch Bandwidth Bound 0.13 -- Retiring 26.27 inst_retired Backend Bound 72.86 -- ├── Core Bound 33.15 -- │ ├── FDIV Stall 0.00 -- │ ├── DIV Stall 0.00 -- │ ├── FSU Stall 0.84 -- │ ├── Resource Bound* 11.94 -- │ │ ├── Rob_stall* 0.15 -- │ │ ├── Ptag_stall* 6.40 -- │ │ ├── MapQ_stall* 5.39 -- │ │ ├── PCBuf_stall* 0.00 -- │ │ └── Other_stall* 0.00 -- │ └── Exe Ports Util 20.37 -- │ ├── 0 ports serialize 0.28 -- │ ├── 0 ports non serialize 12.10 -- │ ├── 1 ports 0.62 -- │ ├── 2 ports 0.29 -- │ ├── 3 ports 0.54 -- │ ├── 4 ports 6.49 -- │ ├── 5 ports 0.04 -- │ └── 6p ports 0.02 -- └── Memory Bound 39.70 -- ├── L1 Bound 3.51 -- │ ├── DTLB 0.18 -- │ ├── Misalign 0.53 -- │ ├── Resource Full 0.00 -- │ ├── Instruction Type 0.14 -- │ ├── Forward hazard 0.15 -- │ ├── Structure hazard 1.77 -- │ └── Pipeline 0.74 -- ├── L2 Bound 0.00 -- │ ├── buffer pending 0.00 -- │ ├── snoop pending 0.00 -- │ ├── Arb idle 0.00 -- │ └── Pipeline 0.00 -- ├── L3 or DRAM Bound 36.20 -- └── Store Bound 0.00 -- ├── SCA 0.00 -- ├── Head 0.00 -- ├── Order 0.00 -- └── Other 0.00 -- ──────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────────── PMU Event Count ──────────────────────────────────────────────────────────────────────────────── r0008 890,812,766,869 r0010 42,361,366 r0011 565,161,956,085 r001b 885,884,852,999 r0027 120,670,847 r0028 29,018,877 r002e 647,664 r0030 24,644,752 r100d 5,983,845 r1010 8,632,549 r1013 16,476 r1016 123,653 r104f 34,432,517 r2004 2,603,284,473 r2005 69,323,025 r2006 5,786,237,013 r2007 29,311 r2008 109,031,479,289 r2009 14 r200a 0 r200b 6,304,758,864 r200c 50,709,639,660 r200d 39,671,785,098 r200f 4,573,402 r2010 6,060,229 r2011 29,618,727,432 r2012 4,204,799,697 r5090 498,155,819 r5091 1,505,244,430 r5092 1,397,778 r5093 399,164,697 r5094 432,410,165 r5095 5,041,495,853 r5096 2,097,655,733 r50a0 159,236,956 r50a2 17,038,846,820 r50a3 256,623,465,481 r50a4 245,022,246 r7000 143,896,308,625 r7001 561,721,629,213 r7002 1,740,417 r7003 25,562,410 r7004 10,080,452,487 r7005 306,104,360,719 r7006 5,399,535 r7007 278,926,699,807 r7008 280,522,927,806 r700a 7,776,571,030 r700b 341,315,121,543 r700c 17,521,518,047 r700d 8,248,288,096 r700e 15,281,449,842 r700f 183,091,541,693 r7010 1,132,511,038 r7011 558,813,174 r701e 6,807,528 r701f 95,045,849 r7020 960,519,537 ──────────────────────────────────────────────────────────────────────────────── 3378 milliseconds time elapsed Metrics marked with '*' indicate approximate values. Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] Optimization Suggestions 1. The percentage of Backend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation. (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved. (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940. (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran. (5) Try enabling huge pages.
Preferred Sampling Event displays key events that affect the microarchitecture binding. You can optimize the binding effect by tuning key events. You can use devkit tuner hotspot -e [Preferred Sampling Event] for the analysis and tuning.
- Collect application data.
1devkit tuner top-down -d 10 -L 2 /opt/testdemo/cache_miss_long
The -d 10 parameter indicates that the collection duration is 10 seconds, and the -L 2 parameter indicates that the Back-End Bound -> Core Bound instruction data is collected.
Command output:
================================================================================ Version : DevKit xxx Command : devkit tuner top-down -d 10 -L 2 /opt/testdemo/cache_miss_long ================================================================================ TOP-DOWN Summary Report-ALL Time:2026/02/03 19:14:04 ======================================================================= Top-down metrics of /opt/testdemo/cache_miss_long: Cycles 28,931,970,351 Instructions 12,298,232,508 IPC 0.43 ──────────────────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ──────────────────────────────────────────────────────────────────────────────── Bad Speculation 0.22 -- Frontend Bound 1.52 -- Retiring 7.08 inst_retired Backend Bound 91.18 -- ├── Core Bound 31.48 -- │ ├── FDIV Stall 0.00 -- │ ├── DIV Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ ├── Resource Bound* 20.10 -- │ │ ├── Rob_stall* 0.04 -- │ │ ├── Ptag_stall* 18.33 -- │ │ ├── MapQ_stall* 1.73 -- │ │ ├── PCBuf_stall* 0.00 -- │ │ └── Other_stall* 0.00 -- │ └── Exe Ports Util 11.38 -- │ ├── 0 ports serialize 0.16 -- │ ├── 0 ports non serialize 8.04 -- │ ├── 1 ports 1.54 -- │ ├── 2 ports 0.87 -- │ ├── 3 ports 0.45 -- │ ├── 4 ports 0.21 -- │ ├── 5 ports 0.07 -- │ └── 6p ports 0.03 -- └── Memory Bound 59.70 -- ──────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────────── PMU Event Count ──────────────────────────────────────────────────────────────────────────────── r0008 12,298,232,508 r0011 28,931,970,351 r001b 12,675,652,675 r2004 51,385,954 r2005 5,877,420 r2006 23,237,998,292 r2007 2,054 r2008 0 r2009 0 r200a 0 r200b 2,188,478,662 r200c 0 r200d 0 r2011 2,638,598,816 r7000 17,589,179,668 r7001 28,842,703,274 r7002 0 r7003 151,662 r7004 0 r7005 18,885,819,361 r7006 0 r700a 417,655,851 r700b 20,454,124,794 r700c 3,923,354,625 r700d 2,222,293,173 r700e 1,147,598,469 r700f 523,512,233 r7010 182,140,658 r7011 87,297,422 ──────────────────────────────────────────────────────────────────────────────── 10003 milliseconds time elapsed Metrics marked with '*' indicate approximate values. Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] Optimization Suggestions 1. The percentage of Backend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation. (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved. (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940. (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran. (5) Try enabling huge pages. - Collect cgroup data.
1devkit tuner top-down -d 10 -L 2 -G my_test_cgroup
-d 10 indicates that the collection duration is 10 seconds. -L 2 indicates that the Back-End Bound->Core Bound instruction data is collected. -G my_test_cgroup indicates that the cgroup named my_test_cgroup is collected.
Command output:
================================================================================ Version : DevKit xxx Command : devkit tuner top-down -d 10 -L 2 -G my_test_cgroup ================================================================================ TOP-DOWN Summary Report-ALL Time:2026/02/03 19:20:10 ======================================================================= Top-down metrics of cgroup: my_test_cgroup: Cycles 28,852,354,047 Instructions 29,575,950,273 IPC 1.03 ──────────────────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ──────────────────────────────────────────────────────────────────────────────── Bad Speculation 47.60 -- Frontend Bound 9.24 -- Retiring 17.08 inst_retired Backend Bound 26.07 -- ├── Core Bound 18.37 -- │ ├── FDIV Stall 0.00 -- │ ├── DIV Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ ├── Resource Bound* 0.97 -- │ │ ├── Rob_stall* 0.00 -- │ │ ├── Ptag_stall* 0.90 -- │ │ ├── MapQ_stall* 0.08 -- │ │ ├── PCBuf_stall* 0.00 -- │ │ └── Other_stall* 0.00 -- │ └── Exe Ports Util 17.40 -- │ ├── 0 ports serialize 0.02 -- │ ├── 0 ports non serialize 4.40 -- │ ├── 1 ports 0.15 -- │ ├── 2 ports 0.81 -- │ ├── 3 ports 2.24 -- │ ├── 4 ports 3.56 -- │ ├── 5 ports 3.47 -- │ └── 6p ports 2.74 -- └── Memory Bound 7.70 -- ──────────────────────────────────────────────────────────────────────────────── ──────────────────────────────────────────────────────────────────────────────── PMU Event Count ──────────────────────────────────────────────────────────────────────────────── r0008 29,575,950,273 r0011 28,852,354,047 r001b 111,983,642,709 r2004 4,947,993 r2005 297,829 r2006 3,396,949,558 r2007 291 r2008 0 r2009 0 r200a 0 r200b 284,335,306 r200c 0 r200d 7,780 r2011 16,002,028,769 r7000 958,854,053 r7001 24,280,047,948 r7002 158 r7003 153,511 r7004 1,838,145 r7005 7,167,944,269 r7006 0 r700a 31,687,641 r700b 7,243,637,081 r700c 246,613,685 r700d 1,332,389,662 r700e 3,690,316,653 r700f 5,861,896,332 r7010 5,718,837,368 r7011 4,515,603,055 ──────────────────────────────────────────────────────────────────────────────── 10118 milliseconds time elapsed Metrics marked with '*' indicate approximate values. Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]