基于ARM PMU（Performance Monitor Unit）事件，获得指令在CPU流水线上的运行情况，用户可以有针对性地修改自己的程序，以充分利用当前的硬件资源。

命令功能

基于ARM PMU事件，获得指令在CPU流水线上的运行情况，快速定位当前应用在CPU上的性能瓶颈。

命令格式

devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-o] [--package] [workload workload...]

devkit tuner top-down [workload workload...]可采集指定应用，命令中[workload workload...]替换为应用路径和应用参数；当参数-c/--cpu和参数-p/--pid存在时，优先采集-p参数指定项。

参数说明

表1 参数说明
参数	参数选项	说明
-h/--help	-	获取帮助信息。
-c/--cpu	-	指定采集的CPU核数，如“0”、“0,1,2”、“0-2”。
-d/--duration	-	设置采集时长，单位为秒，最小值为1秒，默认为一直采集，可使用Ctrl+\取消任务或Ctrl+C停止采集并进入分析。
-D/--delay	-	设置延迟采集时长，默认为0秒，且需小于采集时长。
-i/--interval	-	设置采集间隔，默认为1秒；若已设置采集时长，需小于等于采集时长。
-l/--log-level	0/1/2/3	设置日志级别，默认为1。 0：日志级别为DEBUG。 1：日志级别为INFO。 2：日志级别为WARNING。 3：日志级别为ERROR。
-L/--profile-level	0/1/2/3/4/5/6	设置分析指标，默认为0。 0代表All，采集所有维度数据并输出对应结果。 1代表level 1，采集Back-End Bound，Bad Speculation，Front-End Bound，Retiring。 2代表Back-End Bound->Core Bound：Back-End是处理器处理机制的后置部分，它负责微指令的乱序分发和执行，并返回最终结果。Core Bound是Back-End Bound的子类，该指标能够反映出由于处理器执行单元资源不足导致性能瓶颈的比例情况。 3代表Back-End Bound->Memory Bound：Back-End是处理器处理机制的后置部分，它负责微指令的乱序分发和执行，并返回最终结果。Memory Bound是Back-End Bound的子类，该指标能够反映出由于等待数据读/写导致的流水线阻塞。 4代表Back-End Bound->Resource Bound：（支持鲲鹏920系列处理器）Back-End是处理器处理机制的后置部分，它负责微指令的乱序分发和执行，并返回最终结果。Resource Bound是Back-End Bound的子类，该指标能够反映出由于缺乏资源把微指令分发给乱序执行调度器，从而导致的流水线阻塞情况。 5代表Bad Speculation：该指标能够反映出由于错误的指令预测操作导致的流水线资源浪费情况。 6代表Front-End Bound：该指标代表了处理器处理机制的前置部分，在该部分，指令获取单元负责指令的获取并转化为微指令提供给后置部分的流水线执行。该指标能够反映出处理器前置部分没有被充分利用的比例情况。
-o/--output	-	设置报告数据压缩包名称和输出路径，仅输入名称时默认生成在当前所在目录；需和--package配合使用。
-r/--collection-range	user/kernel/all	设置采集进程的等级，当-p/--pid设置为ALL时，可以收集内核模式进程或用户模式进程。默认为all（采集用户态和内核态的性能数据）。 user：采集用户态的性能数据 kernel：采集内核态的性能数据 all：采集用户态和内核态的性能数据
-p/--pid	PID/PID1,PID2/ALL	指定采集的进程PID，多个进程PID可用“,”分隔，默认采集全部进程（ALL）。若同时使用-p和-c参数则优先采集指定PID的进程。
--package	-	设置是否生成报告数据压缩包，不指定压缩包名称和路径时默认在当前所在目录生成top-down-时间戳.tar。

使用示例

针对CPU采集

devkit tuner top-down -c 0-127 -d 3 -o /home/topdown_cpu -L 2 --package

该命令参数-c 0-127采集0到127的CPU核，采集时长为3秒，参数-o /home/topdown_cpu和--package生成以topdown_cpu命名的报告数据压缩包至指定路径，参数-L 2为采集Back-End Bound->Core Bound指令数据。

返回信息如下：

TOP-DOWN Summary Report-ALL                    Time:2024/08/06 15:41:27
=======================================================================

Top-down metrics of the system:
Cycles                     244,796,223
Instructions               138,949,659
IPC                               0.57

──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                            15.31    --

  Frontend Bound                             40.35    fetch_bubble

  Retiring                                   14.19    inst_retired

  Backend Bound                              30.15    --
  ├── Resource Bound                       4.43    --
  ├── Core Bound                          15.42    --
  │   ├── Divider Stall                   0.00    --
  │   ├── FSU Stall                       0.00    --
  │   └── Exe Ports Util                 15.41    --
  │       ├── ALU BRU IssueQ Full         0.61    --
  │       ├── LS IssueQ Full              1.14    --
  │       └── FSU IssueQ Full             0.00    --
  └── Memory Bound                        10.26    --
──────────────────────────────────────────────────────────────────
3009 milliseconds time elapsed

Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]

The report /home/topdown_cpu.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/topdown_cpu.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

针对进程ID采集

devkit tuner top-down -p 12540 -d 3 -o /home/topdown_pid --package

该命令参数-p 12540采集PID为12540的进程，采集时长为3秒，参数-o /home/topdown_pid和--package生成以topdown_pid命名的报告数据压缩包至指定路径，未指定参数-L则采集所有维度数据。

返回信息如下：

TOP-DOWN Summary Report-ALL                    Time:2024/08/06 15:48:48
=======================================================================

Top-down metrics of process id '1884856':
Cycles                   1,488,556,148
Instructions             1,480,811,195
IPC                               0.99

──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                            55.99    --
  ├── Branch Mispredicts                  55.86    br_mis_pred
  │   ├── Indirect Branch                 0.00    --
  │   ├── Push Branch                     0.00    --
  │   ├── Pop Branch                      0.00    --
  │   └── Other Branch                   55.86    --
  └── Machine Clears                       0.12    --
      ├── Nuke Flush                       0.02    --
      └── Other Flush                      0.09    --

  Frontend Bound                             12.48    fetch_bubble
  ├── Fetch Latency Bound                  9.42    --
  │   ├── ITLB Miss                       0.02    --
  │   │   ├── L1 Tlb                     0.02    --
  │   │   └── L2 Tlb                     0.00    l2i_tlb_refill
  │   ├── ICache Miss                     0.43    --
  │   │   ├── L1 Cache                   0.11    --
  │   │   └── L2 Cache                   0.32    l2i_cache_refill
  │   ├── Branch Mispredict Flush         8.91    br_mis_pred
  │   ├── OoO Flush                       0.01    --
  │   └── Static Predictor Flush          0.05    --
  └── Fetch Bandwidth Bound                3.05    --

  Retiring                                   24.86    inst_retired

  Backend Bound                               6.66    --
  ├── Resource Bound                       0.07    --
  │   ├── Sync Stall                      0.00    --
  │   ├── Reorder Buffer Stall            0.00    --
  │   ├── Physical Tag Stall              0.07    --
  │   ├── SaveOp Queue Stall              0.00    --
  │   ├── PC Buffer Stall                 0.00    --
  │   └── Other Stall                     0.00    --
  ├── Core Bound                           4.80    --
  │   ├── Divider Stall                   0.00    --
  │   ├── FSU Stall                       0.00    --
  │   └── Exe Ports Util                  4.79    --
  │       ├── ALU BRU IssueQ Full         0.03    --
  │       ├── LS IssueQ Full              0.17    --
  │       └── FSU IssueQ Full             0.00    --
  └── Memory Bound                         1.77    --
      ├── L1 Bound                         1.73    --
      ├── L2 Bound                         0.03    --
      ├── L3 or DRAM Bound                 0.01    cache-misses
      └── Store Bound                      0.00    --
──────────────────────────────────────────────────────────────────
3000 milliseconds time elapsed

Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]

The report /home/topdown_pid.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/topdown_pid.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

Preferred Sampling Event展示的是影响微架构bound的关键事件，通过对关键事件调优可以达到优化对应bound的效果；可使用devkit tuner hotspot -e [Preferred Sampling Event]进行分析调优。

针对应用采集

devkit tuner top-down -d 10 -o /home/topdown_app -L 2 --package /opt/testdemo/topdown_suggest

该命令采集时长为10秒，参数-o /home/topdown_app和--package生成以topdown_app命名的报告数据压缩包至指定路径，参数-L 2为采集Back-End Bound->Core Bound指令数据。

返回信息如下：

TOP-DOWN Summary Report-ALL                    Time:2024/12/25 17:45:45
=======================================================================
Top-down metrics of /opt/testdemo/topdown_suggest:
Cycles              25,998,441,500
Instructions        27,700,695,734
IPC                 1.07
──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                             0.01    --
  Frontend Bound                             25.94    fetch_bubble
  Retiring                                   26.63    inst_retired
  Backend Bound                              47.42    --
  ├── Resource Bound                          7.74    --
  ├── Core Bound                             31.47    --
  │   ├── Divider Stall                       0.00    --
  │   ├── FSU Stall                           0.00    --
  │   └── Exe Ports Util                     31.46    --
  │       ├── ALU BRU IssueQ Full            15.67    --
  │       ├── LS IssueQ Full                  3.11    --
  │       └── FSU IssueQ Full                 0.00    --
  └── Memory Bound                            8.20    --
──────────────────────────────────────────────────────────────────
10000 milliseconds time elapsed
Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
Optimization Suggestions
    1. The percentage of Frontend Bound is high.(Threshold: 20.00%)
       Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
       suggestions. Verify the optimization suggestions in your specific application scenario.
       (1) Set the Inline parameter: -mllvm -inline-threshold=1550 (1550 is an optimal empirical value). You are advised to enable LTO in advance.
       (2) Adjust the alignment of functions, that of basic blocks, and that of basic blocks without jumping: -mllvm -align-all-functions=2^n -mllvm
       -align-all-blocks=2^n -mllvm -align-all-nofallthru-blocks=2^n, where the value of 2^n is 32 or 64 and can be changed to 16 or 128 if needed.
       (3) Enable PGO: -mllvm -enable-split-machine-functions. After PGO is enabled, the compiler splits functions based on the popularity of basic blocks and adjusts
       the code block layout to optimize the program performance.
       (4) Enable LTO. There are two types of LTO: full and thin, which correspond to -flto=full and -flto=thin. Full LTO delivers superior performance but requires a
       longer compilation time. Thin LTO delivers inferior performance but needs a shorter compilation time. To enable LTO for the compilation, add a link time option
       to the optimization options of 1) to 3). For example, for -mllvm -enable-split-machine-functions, prefix it with -fuse-ld=lld -Wl, that is, -fuse-ld=lld
       -Wl,-mllvm,-enable-split-machine-functions.
    2. The percentage of Backend Bound is high.(Threshold: 20.00%)
       Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
       suggestions. Verify the optimization suggestions in your specific application scenario.
       (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose
       size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation.
       (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is
       more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved.
       (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The
       compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the
       prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3,
       y=9, z=940.
       (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran.
       (5) Try enabling huge pages.
The report /home/topdown_app.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/topdown_app.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

以上界面回显为微架构分析任务的总览信息可通过--package参数打包生成TAR包，导入Web界面查看图形化信息；导入详情请参见任务管理中的任务导入部分内容。