Rate This Document
Findability
Accuracy
Completeness
Readability

Microarchitecture Analysis

Based on Arm performance monitor unit (PMU) events, you can learn the running status of instructions on the CPU pipeline. You can modify your application accordingly to make full use of your hardware resources.

Table 1 provides the tutorial.

Table 1 Tutorial

Category

Scenario

Link

Best Practices

Microarchitecture analysis

Practice 1: Microarchitecture Analysis

Command Function

Analyzes the running status of instructions on the CPU pipeline based on Arm PMU events, helping quickly locate performance bottlenecks of the current application on the CPUs.

Syntax

1
devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-G cgroup_name] [workload workload...]

devkit tuner top-down [workload workload...] can be used to collect data of a specified application. Replace [workload workload...] in the command with the application path and application parameter. Only one of -c, -p, -G, and the application parameter can be specified.

Parameter Description

Table 2 Parameter description

Parameter

Option

Description

-h/--help

-

Obtains help information. This parameter is optional.

-c/--cpu

-

Numbers of CPU cores to be collected, for example, 0, 0,1,2, and 0-2. This parameter is optional.

-d/--duration

-

Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. This parameter is optional.

-D/--delay

-

Collection delay, which defaults to 0, in seconds, and must be less than the collection duration. This parameter is optional.

-l/--log-level

0/1/2/3

Log level, which defaults to 1. This parameter is optional.
  • 0: DEBUG
  • 1: INFO
  • 2: WARNING
  • 3: ERROR

-L/--profile-level

0/1/2/3/4/5/6

Analysis metric, which defaults to 0. This parameter is optional.

  • 0: Data of all dimensions is collected and a result is generated.
  • 1: Back-End Bound, Bad Speculation, Front-End Bound, and Retiring are collected.
  • 2: The Back-End Bound->Core Bound collection is performed. Back-End is the processor portion that performs out-of-order dispatch and execution of micro-ops (uOps) and returns results. Core Bound is a subclass of Back-End Bound. It reflects the ratio of performance bottlenecks due to insufficient CPU execution unit resources.
  • 3: The Back-End Bound->Memory Bound collection is performed. Back-End is the processor portion that performs out-of-order dispatch and execution of uOps and returns results. Memory Bound is a subclass of Back-End Bound. It reflects pipeline stalls due to data read/write waiting.
  • 4: The Back-End Bound->Resource Bound collection is performed (Kunpeng 920). Back-End is the processor portion that performs out-of-order dispatch and execution of uOps and returns results. Resource Bound is a subclass of Back-End Bound. It reflects pipeline stalls that occur when uOps are dispatched to an out-of-order execution scheduler due to insufficient resources.
  • 5: Bad Speculation is collected. It reflects pipeline resources waste due to incorrect instruction speculations.
  • 6: Front-End Bound is collected. It is a part of a processor where instructions are fetched and decoded into uOps for the back-end pipeline execution. This metric reflects the proportion of processor front-end resources that are under-utilized.

-i/--interval

-

Collection interval, which defaults to 1, in seconds. If the collection duration is set, the collection interval must be less than or equal to the configured collection duration. This parameter is optional.

-p/--pid

-

ID of a process to be collected. Separate multiple PIDs with commas (,). This parameter is optional.

-r/--collection-range

user/kernel/all

Process collection level. When -p/--pid is set to ALL, the option user or kernel can be selected, which means that user-mode processes or kernel-mode processes can be collected. This parameter is optional. The default value is all, which collects user-mode and kernel-mode performance data.

  • user: collects user-mode performance data.
  • kernel: collects kernel-mode performance data.
  • all: collects user-mode and kernel-mode performance data.

-G/--cgroup

-

Monitors the specified process group and manages its resources. Only cgroup v1 and cgroup v2 are supported.

Example

  • Collect CPU data.
    1
    devkit tuner top-down -c 0-127 -d 3  -L 2 
    

    The -c 0-127 parameter indicates that CPU cores 0 to 127 are collected. The -d 3 parameter indicates that the collection duration is 3 seconds. The -L 2 parameter indicates that the Back-End Bound -> Core Bound instruction data is collected.

    Command output:

    ================================================================================
    Version     : DevKit xxx
    CPU Model   : xxx
    Command     : devkit tuner top-down -c 0-127 -d 3 -L 2
    ================================================================================
    
    TOP-DOWN Summary Report-ALL                    Time:2026/02/03 19:05:16
    =======================================================================
    
    Top-down metrics of CPU(s) 0-127:
    Cycles              408,642,602,711
    Instructions        347,968,271,194
    IPC                 0.85
    
    ────────────────────────────────────────────────────────────────────────────────
      Top-down Metrics                        Bound(%)    Preferred Sampling Event
    ────────────────────────────────────────────────────────────────────────────────
      Bad Speculation                             0.16    --
    
      Frontend Bound                              2.05    --
    
      Retiring                                   14.19    inst_retired
    
      Backend Bound                              83.59    --
      ├── Core Bound                             36.21    --
      │   ├── FDIV Stall                          0.00    --
      │   ├── DIV Stall                           0.00    --
      │   ├── FSU Stall                           1.12    --
      │   ├── Resource Bound*                    13.39    --
      │   │   ├── Rob_stall*                      0.48    --
      │   │   ├── Ptag_stall*                     7.18    --
      │   │   ├── MapQ_stall*                     5.72    --
      │   │   ├── PCBuf_stall*                    0.01    --
      │   │   └── Other_stall*                    0.00    --
      │   └── Exe Ports Util                     21.70    --
      │       ├── 0 ports serialize               0.47    --
      │       ├── 0 ports non serialize          13.99    --
      │       ├── 1 ports                         0.81    --
      │       ├── 2 ports                         0.33    --
      │       ├── 3 ports                         0.64    --
      │       ├── 4 ports                         5.40    --
      │       ├── 5 ports                         0.04    --
      │       └── 6p ports                        0.02    --
      └── Memory Bound                           47.38    --
    ────────────────────────────────────────────────────────────────────────────────
    
    ────────────────────────────────────────────────────────────────────────────────
      PMU Event                                  Count
    ────────────────────────────────────────────────────────────────────────────────
      r0008                               347,968,271,194
      r0011                               408,642,602,711
      r001b                               351,964,762,921
      r2004                                 4,757,141,798
      r2005                                   91,300,454
      r2006                                 9,758,187,924
      r2007                                      18,186
      r2008                                62,112,994,434
      r2009                                          0
      r200a                                        164
      r200b                                 4,558,286,815
      r200c                                35,479,858,717
      r200d                                17,243,622,831
      r2011                                50,301,733,180
      r7000                                98,402,039,920
      r7001                               387,268,814,027
      r7002                                    1,216,970
      r7003                                   27,917,649
      r7004                                 8,211,072,317
      r7005                               219,520,201,751
      r7006                                    1,234,618
      r700a                                 8,177,731,287
      r700b                               241,370,982,829
      r700c                                13,916,833,884
      r700d                                 5,613,201,615
      r700e                                11,099,720,734
      r700f                                932,53,410,476
      r7010                                  688,907,852
      r7011                                  345,655,294
    ────────────────────────────────────────────────────────────────────────────────
    3083 milliseconds time elapsed
    
    Metrics marked with '*' indicate approximate values.
    
    Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
  • Collect process IDs.
    1
    devkit tuner top-down -p 3716829 -d 3 
    

    The -p 3716829 parameter indicates that the process with PID 3716829 is collected. The -d 3 parameter indicates that the collection duration is 3 seconds. If the -L parameter is not specified, data of all dimensions is collected.

    Command output:

    ================================================================================
    Version     : DevKit xxx
    Command     : devkit tuner top-down -p 3716829 -d 3
    ================================================================================
    
    TOP-DOWN Summary Report-ALL                    Time:2026/02/03 19:11:04
    =======================================================================
    
    Top-down metrics of process id '3716829':
    Cycles              565,161,956,085
    Instructions        890,812,766,868
    IPC                 1.58
    
    ────────────────────────────────────────────────────────────────────────────────
      Top-down Metrics                        Bound(%)    Preferred Sampling Event
    ────────────────────────────────────────────────────────────────────────────────
      Bad Speculation                             0.00    --
      ├── Branch Mispredicts                      0.00    br_mis_pred
      │   ├── Indirect Branch                     0.00    --
      │   ├── Push Branch                         0.00    --
      │   ├── Pop Branch                          0.00    --
      │   └── Other Branch                        0.00    --
      └── Machine Clears                          0.00    --
          ├── Nuke Flush                          0.00    --
          └── Other Flush                         0.00    --
    
      Frontend Bound                              0.87    --
      ├── Fetch Latency Bound                     0.74    --
      │   ├── ITLB Miss                           0.06    --
      │   ├── ICache Miss                         0.62    --
      │   ├── BP_Misp_Flush                       0.03    br_mis_pred
      │   ├── OoO Flush                           0.01    --
      │   └── Static Predictor Flush              0.03    --
      └── Fetch Bandwidth Bound                   0.13    --
    
      Retiring                                   26.27    inst_retired
    
      Backend Bound                              72.86    --
      ├── Core Bound                             33.15    --
      │   ├── FDIV Stall                          0.00    --
      │   ├── DIV Stall                           0.00    --
      │   ├── FSU Stall                           0.84    --
      │   ├── Resource Bound*                    11.94    --
      │   │   ├── Rob_stall*                      0.15    --
      │   │   ├── Ptag_stall*                     6.40    --
      │   │   ├── MapQ_stall*                     5.39    --
      │   │   ├── PCBuf_stall*                    0.00    --
      │   │   └── Other_stall*                    0.00    --
      │   └── Exe Ports Util                     20.37    --
      │       ├── 0 ports serialize               0.28    --
      │       ├── 0 ports non serialize          12.10    --
      │       ├── 1 ports                         0.62    --
      │       ├── 2 ports                         0.29    --
      │       ├── 3 ports                         0.54    --
      │       ├── 4 ports                         6.49    --
      │       ├── 5 ports                         0.04    --
      │       └── 6p ports                        0.02    --
      └── Memory Bound                           39.70    --
          ├── L1 Bound                            3.51    --
          │   ├── DTLB                            0.18    --
          │   ├── Misalign                        0.53    --
          │   ├── Resource Full                   0.00    --
          │   ├── Instruction Type                0.14    --
          │   ├── Forward hazard                  0.15    --
          │   ├── Structure hazard                1.77    --
          │   └── Pipeline                        0.74    --
          ├── L2 Bound                            0.00    --
          │   ├── buffer pending                  0.00    --
          │   ├── snoop pending                   0.00    --
          │   ├── Arb idle                        0.00    --
          │   └── Pipeline                        0.00    --
          ├── L3 or DRAM Bound                   36.20    --
          └── Store Bound                         0.00    --
              ├── SCA                             0.00    --
              ├── Head                            0.00    --
              ├── Order                           0.00    --
              └── Other                           0.00    --
    ────────────────────────────────────────────────────────────────────────────────
    
    ────────────────────────────────────────────────────────────────────────────────
      PMU Event                                  Count
    ────────────────────────────────────────────────────────────────────────────────
      r0008                               890,812,766,869
      r0010                                   42,361,366
      r0011                               565,161,956,085
      r001b                               885,884,852,999
      r0027                                  120,670,847
      r0028                                   29,018,877
      r002e                                     647,664
      r0030                                   24,644,752
      r100d                                    5,983,845
      r1010                                    8,632,549
      r1013                                      16,476
      r1016                                     123,653
      r104f                                   34,432,517
      r2004                                 2,603,284,473
      r2005                                   69,323,025
      r2006                                 5,786,237,013
      r2007                                      29,311
      r2008                               109,031,479,289
      r2009                                         14
      r200a                                          0
      r200b                                 6,304,758,864
      r200c                                50,709,639,660
      r200d                                39,671,785,098
      r200f                                    4,573,402
      r2010                                    6,060,229
      r2011                                29,618,727,432
      r2012                                 4,204,799,697
      r5090                                  498,155,819
      r5091                                 1,505,244,430
      r5092                                    1,397,778
      r5093                                  399,164,697
      r5094                                  432,410,165
      r5095                                 5,041,495,853
      r5096                                 2,097,655,733
      r50a0                                  159,236,956
      r50a2                                17,038,846,820
      r50a3                               256,623,465,481
      r50a4                                  245,022,246
      r7000                               143,896,308,625
      r7001                               561,721,629,213
      r7002                                    1,740,417
      r7003                                   25,562,410
      r7004                                10,080,452,487
      r7005                               306,104,360,719
      r7006                                    5,399,535
      r7007                               278,926,699,807
      r7008                               280,522,927,806
      r700a                                 7,776,571,030
      r700b                               341,315,121,543
      r700c                                17,521,518,047
      r700d                                 8,248,288,096
      r700e                                15,281,449,842
      r700f                               183,091,541,693
      r7010                                 1,132,511,038
      r7011                                  558,813,174
      r701e                                    6,807,528
      r701f                                   95,045,849
      r7020                                  960,519,537
    ────────────────────────────────────────────────────────────────────────────────
    3378 milliseconds time elapsed
    
    Metrics marked with '*' indicate approximate values.
    
    Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
    
    Optimization Suggestions
    
        1. The percentage of Backend Bound is high.(Threshold: 20.00%)
           Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
           suggestions. Verify the optimization suggestions in your specific application scenario.
           (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose
           size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation.
           (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more
           adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved.
           (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler
           supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density
           by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940.
           (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran.
           (5) Try enabling huge pages.

    Preferred Sampling Event displays key events that affect the microarchitecture binding. You can optimize the binding effect by tuning key events. You can use devkit tuner hotspot -e [Preferred Sampling Event] for the analysis and tuning.

  • Collect application data.
    1
    devkit tuner top-down -d 10 -L 2  /opt/testdemo/cache_miss_long
    

    The -d 10 parameter indicates that the collection duration is 10 seconds, and the -L 2 parameter indicates that the Back-End Bound -> Core Bound instruction data is collected.

    Command output:

    ================================================================================
    Version     : DevKit xxx
    Command     : devkit tuner top-down -d 10 -L 2 /opt/testdemo/cache_miss_long
    ================================================================================
    
    TOP-DOWN Summary Report-ALL                    Time:2026/02/03 19:14:04
    =======================================================================
    
    Top-down metrics of /opt/testdemo/cache_miss_long:
    Cycles              28,931,970,351
    Instructions        12,298,232,508
    IPC                 0.43
    
    ────────────────────────────────────────────────────────────────────────────────
      Top-down Metrics                        Bound(%)    Preferred Sampling Event
    ────────────────────────────────────────────────────────────────────────────────
      Bad Speculation                             0.22    --
    
      Frontend Bound                              1.52    --
    
      Retiring                                    7.08    inst_retired
    
      Backend Bound                              91.18    --
      ├── Core Bound                             31.48    --
      │   ├── FDIV Stall                          0.00    --
      │   ├── DIV Stall                           0.00    --
      │   ├── FSU Stall                           0.00    --
      │   ├── Resource Bound*                    20.10    --
      │   │   ├── Rob_stall*                      0.04    --
      │   │   ├── Ptag_stall*                    18.33    --
      │   │   ├── MapQ_stall*                     1.73    --
      │   │   ├── PCBuf_stall*                    0.00    --
      │   │   └── Other_stall*                    0.00    --
      │   └── Exe Ports Util                     11.38    --
      │       ├── 0 ports serialize               0.16    --
      │       ├── 0 ports non serialize           8.04    --
      │       ├── 1 ports                         1.54    --
      │       ├── 2 ports                         0.87    --
      │       ├── 3 ports                         0.45    --
      │       ├── 4 ports                         0.21    --
      │       ├── 5 ports                         0.07    --
      │       └── 6p ports                        0.03    --
      └── Memory Bound                           59.70    --
    ────────────────────────────────────────────────────────────────────────────────
    
    ────────────────────────────────────────────────────────────────────────────────
      PMU Event                                  Count
    ────────────────────────────────────────────────────────────────────────────────
      r0008                                12,298,232,508
      r0011                                28,931,970,351
      r001b                                12,675,652,675
      r2004                                   51,385,954
      r2005                                    5,877,420
      r2006                                23,237,998,292
      r2007                                       2,054
      r2008                                          0
      r2009                                          0
      r200a                                          0
      r200b                                 2,188,478,662
      r200c                                          0
      r200d                                          0
      r2011                                 2,638,598,816
      r7000                                17,589,179,668
      r7001                                28,842,703,274
      r7002                                          0
      r7003                                     151,662
      r7004                                          0
      r7005                                18,885,819,361
      r7006                                          0
      r700a                                  417,655,851
      r700b                                20,454,124,794
      r700c                                 3,923,354,625
      r700d                                 2,222,293,173
      r700e                                 1,147,598,469
      r700f                                  523,512,233
      r7010                                  182,140,658
      r7011                                   87,297,422
    ────────────────────────────────────────────────────────────────────────────────
    10003 milliseconds time elapsed
    
    Metrics marked with '*' indicate approximate values.
    
    Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
    
    Optimization Suggestions
    
        1. The percentage of Backend Bound is high.(Threshold: 20.00%)
           Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
           suggestions. Verify the optimization suggestions in your specific application scenario.
           (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose
           size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation.
           (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more
           adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved.
           (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler
           supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density
           by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940.
           (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran.
           (5) Try enabling huge pages.
  • Collect cgroup data.
    1
    devkit tuner top-down -d 10 -L 2 -G my_test_cgroup
    

    -d 10 indicates that the collection duration is 10 seconds. -L 2 indicates that the Back-End Bound->Core Bound instruction data is collected. -G my_test_cgroup indicates that the cgroup named my_test_cgroup is collected.

    Command output:

    ================================================================================
    Version     : DevKit xxx
    Command     : devkit tuner top-down -d 10 -L 2 -G my_test_cgroup
    ================================================================================
    
    TOP-DOWN Summary Report-ALL                    Time:2026/02/03 19:20:10
    =======================================================================
    
    Top-down metrics of cgroup: my_test_cgroup:
    Cycles              28,852,354,047
    Instructions        29,575,950,273
    IPC                 1.03
    
    ────────────────────────────────────────────────────────────────────────────────
      Top-down Metrics                        Bound(%)    Preferred Sampling Event
    ────────────────────────────────────────────────────────────────────────────────
      Bad Speculation                            47.60    --
    
      Frontend Bound                              9.24    --
    
      Retiring                                   17.08    inst_retired
    
      Backend Bound                              26.07    --
      ├── Core Bound                             18.37    --
      │   ├── FDIV Stall                          0.00    --
      │   ├── DIV Stall                           0.00    --
      │   ├── FSU Stall                           0.00    --
      │   ├── Resource Bound*                     0.97    --
      │   │   ├── Rob_stall*                      0.00    --
      │   │   ├── Ptag_stall*                     0.90    --
      │   │   ├── MapQ_stall*                     0.08    --
      │   │   ├── PCBuf_stall*                    0.00    --
      │   │   └── Other_stall*                    0.00    --
      │   └── Exe Ports Util                     17.40    --
      │       ├── 0 ports serialize               0.02    --
      │       ├── 0 ports non serialize           4.40    --
      │       ├── 1 ports                         0.15    --
      │       ├── 2 ports                         0.81    --
      │       ├── 3 ports                         2.24    --
      │       ├── 4 ports                         3.56    --
      │       ├── 5 ports                         3.47    --
      │       └── 6p ports                        2.74    --
      └── Memory Bound                            7.70    --
    ────────────────────────────────────────────────────────────────────────────────
    
    ────────────────────────────────────────────────────────────────────────────────
      PMU Event                                  Count
    ────────────────────────────────────────────────────────────────────────────────
      r0008                                29,575,950,273
      r0011                                28,852,354,047
      r001b                               111,983,642,709
      r2004                                    4,947,993
      r2005                                     297,829
      r2006                                 3,396,949,558
      r2007                                        291
      r2008                                          0
      r2009                                          0
      r200a                                          0
      r200b                                  284,335,306
      r200c                                          0
      r200d                                       7,780
      r2011                                16,002,028,769
      r7000                                  958,854,053
      r7001                                24,280,047,948
      r7002                                        158
      r7003                                     153,511
      r7004                                    1,838,145
      r7005                                 7,167,944,269
      r7006                                          0
      r700a                                   31,687,641
      r700b                                 7,243,637,081
      r700c                                  246,613,685
      r700d                                 1,332,389,662
      r700e                                 3,690,316,653
      r700f                                 5,861,896,332
      r7010                                 5,718,837,368
      r7011                                 4,515,603,055
    ────────────────────────────────────────────────────────────────────────────────
    10118 milliseconds time elapsed
    
    Metrics marked with '*' indicate approximate values.
    
    Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]