开发者
我要评分
获取效率
正确性
完整性
易理解
在线提单
论坛求助

微架构分析

基于ARM PMU(Performance Monitor Unit)事件,获得指令在CPU流水线上的运行情况,用户可以有针对性地修改自己的程序,以充分利用当前的硬件资源。

示例教程可参见表1

表1 示例教程

分类

场景

链接

最佳实践

微架构分析

实践1:微架构分析

命令功能

基于ARM PMU事件,获得指令在CPU流水线上的运行情况,快速定位当前应用在CPU上的性能瓶颈。

命令格式

1
devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-G cgroup_name] [workload workload...]

devkit tuner top-down [workload workload...]可采集指定应用,命令中[workload workload...]替换为应用路径和应用参数。采集时,“-c”“-p”“-G”或应用参数中最多只能指定一个。

参数说明

表2 参数说明

参数

参数选项

说明

-h/--help

-

可选参数,获取帮助信息。

-c/--cpu

-

可选参数,指定采集的CPU核心编号,如“0”、“0,1,2”、“0-2”。

-d/--duration

-

可选参数,设置采集时长,单位为秒,最小值为1秒,默认为一直采集,可使用Ctrl+\取消任务或Ctrl+C停止采集并进入分析。

-D/--delay

-

可选参数,设置延迟采集时长,单位为秒,默认为0秒,且需小于采集时长。

-l/--log-level

0/1/2/3

可选参数,设置日志级别,默认为1。
  • 0:日志级别为DEBUG。
  • 1:日志级别为INFO。
  • 2:日志级别为WARNING。
  • 3:日志级别为ERROR。

-L/--profile-level

0/1/2/3/4/5/6

可选参数,设置分析指标,默认为0。

  • 0代表All,采集所有维度数据并输出对应结果。
  • 1代表level 1,采集Back-End Bound,Bad Speculation,Front-End Bound,Retiring
  • 2代表Back-End Bound->Core Bound:Back-End是处理器处理机制的后置部分,它负责微指令的乱序分发和执行,并返回最终结果。Core Bound是Back-End Bound的子类,该指标能够反映出由于处理器执行单元资源不足导致性能瓶颈的比例情况。
  • 3代表Back-End Bound->Memory Bound:Back-End是处理器处理机制的后置部分,它负责微指令的乱序分发和执行,并返回最终结果。Memory Bound是Back-End Bound的子类,该指标能够反映出由于等待数据读/写导致的流水线阻塞。
  • 4代表Back-End Bound->Resource Bound:(支持鲲鹏920系列处理器)Back-End是处理器处理机制的后置部分,它负责微指令的乱序分发和执行,并返回最终结果。Resource Bound是Back-End Bound的子类,该指标能够反映出由于缺乏资源把微指令分发给乱序执行调度器,从而导致的流水线阻塞情况。
  • 5代表Bad Speculation:该指标能够反映出由于错误的指令预测操作导致的流水线资源浪费情况。
  • 6代表Front-End Bound:该指标代表了处理器处理机制的前置部分,在该部分,指令获取单元负责指令的获取并转化为微指令提供给后置部分的流水线执行。该指标能够反映出处理器前置部分没有被充分利用的比例情况。

-i/--interval

-

可选参数,设置采集间隔,单位为秒,默认为1秒;若已设置采集时长,需小于等于采集时长。

-p/--pid

-

可选参数,指定采集的进程PID,多个进程PID可用“,”分隔。

-r/--collection-range

user/kernel/all

可选参数,设置采集进程的等级,当“-p”/“--pid”设置为ALL时,可以收集内核模式进程或用户模式进程。默认为all(采集用户态和内核态的性能数据)。

  • user:采集用户态的性能数据
  • kernel:采集内核态的性能数据
  • all:采集用户态和内核态的性能数据

-G/--cgroup

-

可选参数,对指定采集的进程组进行监控和资源控制管理。当前仅支持cgroup v1和cgroup v2。

使用示例

  • 对CPU采集。
    1
    devkit tuner top-down -c 0-127 -d 3  -L 2 
    

    -c 0-127采集0到127的CPU核,-d 3采集时长为3秒,-L 2为采集Back-End Bound->Core Bound指令数据。

    返回信息如下:

    ================================================================================
    Version     : DevKit xxx
    CPU Model   :xxx
    Command     : devkit tuner top-down -c 0-127 -d 3 -L 2
    ================================================================================
    
    TOP-DOWN Summary Report-ALL                    Time:2026/02/03 19:05:16
    =======================================================================
    
    Top-down metrics of CPU(s) 0-127:
    Cycles              408,642,602,711
    Instructions        347,968,271,194
    IPC                 0.85
    
    ────────────────────────────────────────────────────────────────────────────────
      Top-down Metrics                        Bound(%)    Preferred Sampling Event
    ────────────────────────────────────────────────────────────────────────────────
      Bad Speculation                             0.16    --
    
      Frontend Bound                              2.05    --
    
      Retiring                                   14.19    inst_retired
    
      Backend Bound                              83.59    --
      ├── Core Bound                             36.21    --
      │   ├── FDIV Stall                          0.00    --
      │   ├── DIV Stall                           0.00    --
      │   ├── FSU Stall                           1.12    --
      │   ├── Resource Bound*                    13.39    --
      │   │   ├── Rob_stall*                      0.48    --
      │   │   ├── Ptag_stall*                     7.18    --
      │   │   ├── MapQ_stall*                     5.72    --
      │   │   ├── PCBuf_stall*                    0.01    --
      │   │   └── Other_stall*                    0.00    --
      │   └── Exe Ports Util                     21.70    --
      │       ├── 0 ports serialize               0.47    --
      │       ├── 0 ports non serialize          13.99    --
      │       ├── 1 ports                         0.81    --
      │       ├── 2 ports                         0.33    --
      │       ├── 3 ports                         0.64    --
      │       ├── 4 ports                         5.40    --
      │       ├── 5 ports                         0.04    --
      │       └── 6p ports                        0.02    --
      └── Memory Bound                           47.38    --
    ────────────────────────────────────────────────────────────────────────────────
    
    ────────────────────────────────────────────────────────────────────────────────
      PMU Event                                  Count
    ────────────────────────────────────────────────────────────────────────────────
      r0008                               347,968,271,194
      r0011                               408,642,602,711
      r001b                               351,964,762,921
      r2004                                 4,757,141,798
      r2005                                   91,300,454
      r2006                                 9,758,187,924
      r2007                                      18,186
      r2008                                62,112,994,434
      r2009                                          0
      r200a                                        164
      r200b                                 4,558,286,815
      r200c                                35,479,858,717
      r200d                                17,243,622,831
      r2011                                50,301,733,180
      r7000                                98,402,039,920
      r7001                               387,268,814,027
      r7002                                    1,216,970
      r7003                                   27,917,649
      r7004                                 8,211,072,317
      r7005                               219,520,201,751
      r7006                                    1,234,618
      r700a                                 8,177,731,287
      r700b                               241,370,982,829
      r700c                                13,916,833,884
      r700d                                 5,613,201,615
      r700e                                11,099,720,734
      r700f                                932,53,410,476
      r7010                                  688,907,852
      r7011                                  345,655,294
    ────────────────────────────────────────────────────────────────────────────────
    3083 milliseconds time elapsed
    
    Metrics marked with '*' indicate approximate values.
    
    Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
  • 对进程ID采集。
    1
    devkit tuner top-down -p 3716829 -d 3 
    

    -p 3716829采集PID为3716829的进程,-d 3采集时长为3秒,未指定参数“-L”则采集所有维度数据。

    返回信息如下:

    ================================================================================
    Version     : DevKit xxx
    Command     : devkit tuner top-down -p 3716829 -d 3
    ================================================================================
    
    TOP-DOWN Summary Report-ALL                    Time:2026/02/03 19:11:04
    =======================================================================
    
    Top-down metrics of process id '3716829':
    Cycles              565,161,956,085
    Instructions        890,812,766,868
    IPC                 1.58
    
    ────────────────────────────────────────────────────────────────────────────────
      Top-down Metrics                        Bound(%)    Preferred Sampling Event
    ────────────────────────────────────────────────────────────────────────────────
      Bad Speculation                             0.00    --
      ├── Branch Mispredicts                      0.00    br_mis_pred
      │   ├── Indirect Branch                     0.00    --
      │   ├── Push Branch                         0.00    --
      │   ├── Pop Branch                          0.00    --
      │   └── Other Branch                        0.00    --
      └── Machine Clears                          0.00    --
          ├── Nuke Flush                          0.00    --
          └── Other Flush                         0.00    --
    
      Frontend Bound                              0.87    --
      ├── Fetch Latency Bound                     0.74    --
      │   ├── ITLB Miss                           0.06    --
      │   ├── ICache Miss                         0.62    --
      │   ├── BP_Misp_Flush                       0.03    br_mis_pred
      │   ├── OoO Flush                           0.01    --
      │   └── Static Predictor Flush              0.03    --
      └── Fetch Bandwidth Bound                   0.13    --
    
      Retiring                                   26.27    inst_retired
    
      Backend Bound                              72.86    --
      ├── Core Bound                             33.15    --
      │   ├── FDIV Stall                          0.00    --
      │   ├── DIV Stall                           0.00    --
      │   ├── FSU Stall                           0.84    --
      │   ├── Resource Bound*                    11.94    --
      │   │   ├── Rob_stall*                      0.15    --
      │   │   ├── Ptag_stall*                     6.40    --
      │   │   ├── MapQ_stall*                     5.39    --
      │   │   ├── PCBuf_stall*                    0.00    --
      │   │   └── Other_stall*                    0.00    --
      │   └── Exe Ports Util                     20.37    --
      │       ├── 0 ports serialize               0.28    --
      │       ├── 0 ports non serialize          12.10    --
      │       ├── 1 ports                         0.62    --
      │       ├── 2 ports                         0.29    --
      │       ├── 3 ports                         0.54    --
      │       ├── 4 ports                         6.49    --
      │       ├── 5 ports                         0.04    --
      │       └── 6p ports                        0.02    --
      └── Memory Bound                           39.70    --
          ├── L1 Bound                            3.51    --
          │   ├── DTLB                            0.18    --
          │   ├── Misalign                        0.53    --
          │   ├── Resource Full                   0.00    --
          │   ├── Instruction Type                0.14    --
          │   ├── Forward hazard                  0.15    --
          │   ├── Structure hazard                1.77    --
          │   └── Pipeline                        0.74    --
          ├── L2 Bound                            0.00    --
          │   ├── buffer pending                  0.00    --
          │   ├── snoop pending                   0.00    --
          │   ├── Arb idle                        0.00    --
          │   └── Pipeline                        0.00    --
          ├── L3 or DRAM Bound                   36.20    --
          └── Store Bound                         0.00    --
              ├── SCA                             0.00    --
              ├── Head                            0.00    --
              ├── Order                           0.00    --
              └── Other                           0.00    --
    ────────────────────────────────────────────────────────────────────────────────
    
    ────────────────────────────────────────────────────────────────────────────────
      PMU Event                                  Count
    ────────────────────────────────────────────────────────────────────────────────
      r0008                               890,812,766,869
      r0010                                   42,361,366
      r0011                               565,161,956,085
      r001b                               885,884,852,999
      r0027                                  120,670,847
      r0028                                   29,018,877
      r002e                                     647,664
      r0030                                   24,644,752
      r100d                                    5,983,845
      r1010                                    8,632,549
      r1013                                      16,476
      r1016                                     123,653
      r104f                                   34,432,517
      r2004                                 2,603,284,473
      r2005                                   69,323,025
      r2006                                 5,786,237,013
      r2007                                      29,311
      r2008                               109,031,479,289
      r2009                                         14
      r200a                                          0
      r200b                                 6,304,758,864
      r200c                                50,709,639,660
      r200d                                39,671,785,098
      r200f                                    4,573,402
      r2010                                    6,060,229
      r2011                                29,618,727,432
      r2012                                 4,204,799,697
      r5090                                  498,155,819
      r5091                                 1,505,244,430
      r5092                                    1,397,778
      r5093                                  399,164,697
      r5094                                  432,410,165
      r5095                                 5,041,495,853
      r5096                                 2,097,655,733
      r50a0                                  159,236,956
      r50a2                                17,038,846,820
      r50a3                               256,623,465,481
      r50a4                                  245,022,246
      r7000                               143,896,308,625
      r7001                               561,721,629,213
      r7002                                    1,740,417
      r7003                                   25,562,410
      r7004                                10,080,452,487
      r7005                               306,104,360,719
      r7006                                    5,399,535
      r7007                               278,926,699,807
      r7008                               280,522,927,806
      r700a                                 7,776,571,030
      r700b                               341,315,121,543
      r700c                                17,521,518,047
      r700d                                 8,248,288,096
      r700e                                15,281,449,842
      r700f                               183,091,541,693
      r7010                                 1,132,511,038
      r7011                                  558,813,174
      r701e                                    6,807,528
      r701f                                   95,045,849
      r7020                                  960,519,537
    ────────────────────────────────────────────────────────────────────────────────
    3378 milliseconds time elapsed
    
    Metrics marked with '*' indicate approximate values.
    
    Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
    
    Optimization Suggestions
    
        1. The percentage of Backend Bound is high.(Threshold: 20.00%)
           Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
           suggestions. Verify the optimization suggestions in your specific application scenario.
           (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose
           size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation.
           (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more
           adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved.
           (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler
           supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density
           by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940.
           (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran.
           (5) Try enabling huge pages.

    Preferred Sampling Event展示的是影响微架构bound的关键事件,通过对关键事件调优可以达到优化对应bound的效果;可使用devkit tuner hotspot -e [Preferred Sampling Event]进行分析调优。

  • 对应用采集。
    1
    devkit tuner top-down -d 10 -L 2  /opt/testdemo/cache_miss_long
    

    -d 10采集时长为10秒,-L 2为采集Back-End Bound->Core Bound指令数据。

    返回信息如下:

    ================================================================================
    Version     : DevKit xxx
    Command     : devkit tuner top-down -d 10 -L 2 /opt/testdemo/cache_miss_long
    ================================================================================
    
    TOP-DOWN Summary Report-ALL                    Time:2026/02/03 19:14:04
    =======================================================================
    
    Top-down metrics of /opt/testdemo/cache_miss_long:
    Cycles              28,931,970,351
    Instructions        12,298,232,508
    IPC                 0.43
    
    ────────────────────────────────────────────────────────────────────────────────
      Top-down Metrics                        Bound(%)    Preferred Sampling Event
    ────────────────────────────────────────────────────────────────────────────────
      Bad Speculation                             0.22    --
    
      Frontend Bound                              1.52    --
    
      Retiring                                    7.08    inst_retired
    
      Backend Bound                              91.18    --
      ├── Core Bound                             31.48    --
      │   ├── FDIV Stall                          0.00    --
      │   ├── DIV Stall                           0.00    --
      │   ├── FSU Stall                           0.00    --
      │   ├── Resource Bound*                    20.10    --
      │   │   ├── Rob_stall*                      0.04    --
      │   │   ├── Ptag_stall*                    18.33    --
      │   │   ├── MapQ_stall*                     1.73    --
      │   │   ├── PCBuf_stall*                    0.00    --
      │   │   └── Other_stall*                    0.00    --
      │   └── Exe Ports Util                     11.38    --
      │       ├── 0 ports serialize               0.16    --
      │       ├── 0 ports non serialize           8.04    --
      │       ├── 1 ports                         1.54    --
      │       ├── 2 ports                         0.87    --
      │       ├── 3 ports                         0.45    --
      │       ├── 4 ports                         0.21    --
      │       ├── 5 ports                         0.07    --
      │       └── 6p ports                        0.03    --
      └── Memory Bound                           59.70    --
    ────────────────────────────────────────────────────────────────────────────────
    
    ────────────────────────────────────────────────────────────────────────────────
      PMU Event                                  Count
    ────────────────────────────────────────────────────────────────────────────────
      r0008                                12,298,232,508
      r0011                                28,931,970,351
      r001b                                12,675,652,675
      r2004                                   51,385,954
      r2005                                    5,877,420
      r2006                                23,237,998,292
      r2007                                       2,054
      r2008                                          0
      r2009                                          0
      r200a                                          0
      r200b                                 2,188,478,662
      r200c                                          0
      r200d                                          0
      r2011                                 2,638,598,816
      r7000                                17,589,179,668
      r7001                                28,842,703,274
      r7002                                          0
      r7003                                     151,662
      r7004                                          0
      r7005                                18,885,819,361
      r7006                                          0
      r700a                                  417,655,851
      r700b                                20,454,124,794
      r700c                                 3,923,354,625
      r700d                                 2,222,293,173
      r700e                                 1,147,598,469
      r700f                                  523,512,233
      r7010                                  182,140,658
      r7011                                   87,297,422
    ────────────────────────────────────────────────────────────────────────────────
    10003 milliseconds time elapsed
    
    Metrics marked with '*' indicate approximate values.
    
    Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
    
    Optimization Suggestions
    
        1. The percentage of Backend Bound is high.(Threshold: 20.00%)
           Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
           suggestions. Verify the optimization suggestions in your specific application scenario.
           (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose
           size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation.
           (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more
           adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved.
           (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler
           supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density
           by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940.
           (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran.
           (5) Try enabling huge pages.
  • 对cgroup采集。
    1
    devkit tuner top-down -d 10 -L 2 -G my_test_cgroup
    

    -d 10采集时长为10秒,-L 2为采集Back-End Bound->Core Bound指令数据,-G my_test_cgroup表示对名为my_test_cgroup的cgroup进行采集。

    返回信息如下:

    ================================================================================
    Version     : DevKit xxx
    Command     : devkit tuner top-down -d 10 -L 2 -G my_test_cgroup
    ================================================================================
    
    TOP-DOWN Summary Report-ALL                    Time:2026/02/03 19:20:10
    =======================================================================
    
    Top-down metrics of cgroup: my_test_cgroup:
    Cycles              28,852,354,047
    Instructions        29,575,950,273
    IPC                 1.03
    
    ────────────────────────────────────────────────────────────────────────────────
      Top-down Metrics                        Bound(%)    Preferred Sampling Event
    ────────────────────────────────────────────────────────────────────────────────
      Bad Speculation                            47.60    --
    
      Frontend Bound                              9.24    --
    
      Retiring                                   17.08    inst_retired
    
      Backend Bound                              26.07    --
      ├── Core Bound                             18.37    --
      │   ├── FDIV Stall                          0.00    --
      │   ├── DIV Stall                           0.00    --
      │   ├── FSU Stall                           0.00    --
      │   ├── Resource Bound*                     0.97    --
      │   │   ├── Rob_stall*                      0.00    --
      │   │   ├── Ptag_stall*                     0.90    --
      │   │   ├── MapQ_stall*                     0.08    --
      │   │   ├── PCBuf_stall*                    0.00    --
      │   │   └── Other_stall*                    0.00    --
      │   └── Exe Ports Util                     17.40    --
      │       ├── 0 ports serialize               0.02    --
      │       ├── 0 ports non serialize           4.40    --
      │       ├── 1 ports                         0.15    --
      │       ├── 2 ports                         0.81    --
      │       ├── 3 ports                         2.24    --
      │       ├── 4 ports                         3.56    --
      │       ├── 5 ports                         3.47    --
      │       └── 6p ports                        2.74    --
      └── Memory Bound                            7.70    --
    ────────────────────────────────────────────────────────────────────────────────
    
    ────────────────────────────────────────────────────────────────────────────────
      PMU Event                                  Count
    ────────────────────────────────────────────────────────────────────────────────
      r0008                                29,575,950,273
      r0011                                28,852,354,047
      r001b                               111,983,642,709
      r2004                                    4,947,993
      r2005                                     297,829
      r2006                                 3,396,949,558
      r2007                                        291
      r2008                                          0
      r2009                                          0
      r200a                                          0
      r200b                                  284,335,306
      r200c                                          0
      r200d                                       7,780
      r2011                                16,002,028,769
      r7000                                  958,854,053
      r7001                                24,280,047,948
      r7002                                        158
      r7003                                     153,511
      r7004                                    1,838,145
      r7005                                 7,167,944,269
      r7006                                          0
      r700a                                   31,687,641
      r700b                                 7,243,637,081
      r700c                                  246,613,685
      r700d                                 1,332,389,662
      r700e                                 3,690,316,653
      r700f                                 5,861,896,332
      r7010                                 5,718,837,368
      r7011                                 4,515,603,055
    ────────────────────────────────────────────────────────────────────────────────
    10118 milliseconds time elapsed
    
    Metrics marked with '*' indicate approximate values.
    
    Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]