Microarchitecture Analysis

Command Function

Obtains the running status of instructions on the CPU pipeline based on Arm performance monitor unit (PMU) events, helping quickly locate performance bottlenecks of the current application on the CPU. You can modify your application to make full use of hardware resources.

Syntax

devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-o] [--package] [workload workload...]

devkit tuner top-down [workload workload...] can be used to collect data of a specified application. Replace [workload workload...] in the command with the application path and application parameter. If the -c/--cpu and -p/--pid parameters both exist, data specified by the -p parameter is preferentially collected.

Parameter Description

**Table 1** Parameter description
Parameter	Option	Description
-h/--help	-	Obtains help information.
-c/--cpu	-	Number of CPU cores to be collected. The value can be 0 or 0, 1, 2 or 0-2.
-d/--duration	-	Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis.
-D/--delay	-	Collection delay, which defaults to 0 seconds and must be less than the collection duration.
-i/--interval	-	Collection interval, which defaults to 1 second. If the collection duration is set, the collection interval must be less than or equal to the collection duration.
-l/--log-level	0/1/2/3	Log level, which defaults to 1. 0: DEBUG 1: INFO 2: WARNING 3: ERROR
-L/--profile-level	0/1/2/3/4/5/6	Analysis metric, which defaults to 0. 0: Data of all dimensions is collected and a result is generated. 1: Back-End Bound, Bad Speculation, Front-End Bound, and Retiring are collected. 2: The Back-End Bound->Core Bound collection is performed. Back-End is the processor portion that performs out-of-order dispatch and execution of micro-ops (uOps) and returns results. Core Bound is a subclass of Back-End Bound. It reflects the ratio of performance bottlenecks due to insufficient CPU execution unit resources. 3: The Back-End Bound->Memory Bound collection is performed. Back-End is the processor portion that performs out-of-order dispatch and execution of uOps and returns results. Memory Bound is a subclass of Back-End Bound. It reflects pipeline stalls due to data read/write waiting. 4: The Back-End Bound->Resource Bound collection is performed (applicable to Kunpeng 920 series processors). Back-End is the processor portion that performs out-of-order dispatch and execution of uOps and returns results. Resource Bound is a subclass of Back-End Bound. It reflects pipeline stalls that occur when uOps are dispatched to an out-of-order execution scheduler due to insufficient resources. 5: Bad Speculation is collected. It reflects pipeline resources waste due to incorrect instruction speculations. 6: Front-End Bound is collected. It is a part of a processor where instructions are fetched and decoded into uOps for the back-end pipeline execution. This metric reflects the proportion of processor front-end resources that are under-utilized.
-o/--output	-	Report package name and output path. If you enter a name only, the report package is generated in the current directory by default. This option must be used together with --package.
-r/--collection-range	user/kernel/all	Process collection level. When -p/--pid is set to ALL, the option user or kernel can be selected, which means that user-mode processes or kernel-mode processes can be collected. The default value is all, which collects user-mode and kernel-mode performance data. user: collects user-mode performance data. kernel: collects kernel-mode performance data. all: collects user-mode and kernel-mode performance data.
-p/--pid	PID/PID1, PID2/ALL	ID of a process to be collected. Separate multiple PIDs with commas (,). The default value is ALL. If both the -p and -c parameters are used, the processes with the specified PIDs are preferentially collected.
--package	-	Indicates whether to generate a report data package. If you do not set the package name or path, the top-down-timestamp.tar package is generated in the current directory by default.

Example

Collection based on CPUs:

devkit tuner top-down -c 0-127 -d 3 -o /home/topdown_cpu -L 2 --package

The -c 0-127 parameter in this command collects CPU cores 0 to 127 with a collection duration of 3 seconds. The -o /home/topdown_cpu and --package parameters generate a report data package named topdown_cpu to a specified path. The -L 2 parameter collects the Back-End Bound->Core Bound instruction data.

Command output:

TOP-DOWN Summary Report-ALL                    Time:2024/08/06 15:41:27
=======================================================================

Top-down metrics of the system:
Cycles                     244,796,223
Instructions               138,949,659
IPC                               0.57

──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                            15.31    --

  Frontend Bound                             40.35    fetch_bubble

  Retiring                                   14.19    inst_retired

  Backend Bound                              30.15    --
  ├── Resource Bound                       4.43    --
  ├── Core Bound                          15.42    --
  │   ├── Divider Stall                   0.00    --
  │   ├── FSU Stall                       0.00    --
  │   └── Exe Ports Util                 15.41    --
  │       ├── ALU BRU IssueQ Full         0.61    --
  │       ├── LS IssueQ Full              1.14    --
  │       └── FSU IssueQ Full             0.00    --
  └── Memory Bound                        10.26    --
──────────────────────────────────────────────────────────────────
3009 milliseconds time elapsed

Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]

The report /home/topdown_cpu.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/topdown_cpu.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

Collection based on process IDs:

devkit tuner top-down -p 12540 -d 3 -o /home/topdown_pid --package

In this command, -p 12540 collects the process whose ID is 12540 with a collection duration of 3 seconds. -o /home/topdown_pid and --package generate a report data package named topdown_pid to a specified path. If -L is not specified, data of all dimensions is collected.

Command output:

TOP-DOWN Summary Report-ALL                    Time:2024/08/06 15:48:48
=======================================================================

Top-down metrics of process id '1884856':
Cycles                   1,488,556,148
Instructions             1,480,811,195
IPC                               0.99

──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                            55.99    --
  ├── Branch Mispredicts                  55.86    br_mis_pred
  │   ├── Indirect Branch                 0.00    --
  │   ├── Push Branch                     0.00    --
  │   ├── Pop Branch                      0.00    --
  │   └── Other Branch                   55.86    --
  └── Machine Clears                       0.12    --
      ├── Nuke Flush                       0.02    --
      └── Other Flush                      0.09    --

  Frontend Bound                             12.48    fetch_bubble
  ├── Fetch Latency Bound                  9.42    --
  │   ├── ITLB Miss                       0.02    --
  │   │   ├── L1 Tlb                     0.02    --
  │   │   └── L2 Tlb                     0.00    l2i_tlb_refill
  │   ├── ICache Miss                     0.43    --
  │   │   ├── L1 Cache                   0.11    --
  │   │   └── L2 Cache                   0.32    l2i_cache_refill
  │   ├── Branch Mispredict Flush         8.91    br_mis_pred
  │   ├── OoO Flush                       0.01    --
  │   └── Static Predictor Flush          0.05    --
  └── Fetch Bandwidth Bound                3.05    --

  Retiring                                   24.86    inst_retired

  Backend Bound                               6.66    --
  ├── Resource Bound                       0.07    --
  │   ├── Sync Stall                      0.00    --
  │   ├── Reorder Buffer Stall            0.00    --
  │   ├── Physical Tag Stall              0.07    --
  │   ├── SaveOp Queue Stall              0.00    --
  │   ├── PC Buffer Stall                 0.00    --
  │   └── Other Stall                     0.00    --
  ├── Core Bound                           4.80    --
  │   ├── Divider Stall                   0.00    --
  │   ├── FSU Stall                       0.00    --
  │   └── Exe Ports Util                  4.79    --
  │       ├── ALU BRU IssueQ Full         0.03    --
  │       ├── LS IssueQ Full              0.17    --
  │       └── FSU IssueQ Full             0.00    --
  └── Memory Bound                         1.77    --
      ├── L1 Bound                         1.73    --
      ├── L2 Bound                         0.03    --
      ├── L3 or DRAM Bound                 0.01    cache-misses
      └── Store Bound                      0.00    --
──────────────────────────────────────────────────────────────────
3000 milliseconds time elapsed

Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]

The report /home/topdown_pid.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/topdown_pid.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

Preferred Sampling Event displays key events that affect the microarchitecture binding. You can optimize the binding effect by tuning key events. You can use devkit tuner hotspot -e [Preferred Sampling Event] for the analysis and tuning.

Collection based on applications:

devkit tuner top-down -d 10 -o /home/topdown_app -L 2 --package /opt/testdemo/topdown_suggest

The collection duration in this command is 10 seconds. The -o /home/topdown_app and --package parameters generate a report data package named topdown_app to a specified path. The -L 2 parameter collects the Back-End Bound->Core Bound instruction data.

Command output:

TOP-DOWN Summary Report-ALL                    Time:2024/12/25 17:45:45
=======================================================================
Top-down metrics of /opt/testdemo/topdown_suggest:
Cycles              25,998,441,500
Instructions        27,700,695,734
IPC                 1.07
──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                             0.01    --
  Frontend Bound                             25.94    fetch_bubble
  Retiring                                   26.63    inst_retired
  Backend Bound                              47.42    --
  ├── Resource Bound                          7.74    --
  ├── Core Bound                             31.47    --
  │   ├── Divider Stall                       0.00    --
  │   ├── FSU Stall                           0.00    --
  │   └── Exe Ports Util                     31.46    --
  │       ├── ALU BRU IssueQ Full            15.67    --
  │       ├── LS IssueQ Full                  3.11    --
  │       └── FSU IssueQ Full                 0.00    --
  └── Memory Bound                            8.20    --
──────────────────────────────────────────────────────────────────
10000 milliseconds time elapsed
Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
Optimization Suggestions
    1. The percentage of Frontend Bound is high.(Threshold: 20.00%)
       Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
       suggestions. Verify the optimization suggestions in your specific application scenario.
       (1) Set the Inline parameter: -mllvm -inline-threshold=1550 (1550 is an optimal empirical value). You are advised to enable LTO in advance.
       (2) Adjust the alignment of functions, that of basic blocks, and that of basic blocks without jumping: -mllvm -align-all-functions=2^n -mllvm
       -align-all-blocks=2^n -mllvm -align-all-nofallthru-blocks=2^n, where the value of 2^n is 32 or 64 and can be changed to 16 or 128 if needed.
       (3) Enable PGO: -mllvm -enable-split-machine-functions. After PGO is enabled, the compiler splits functions based on the popularity of basic blocks and adjusts
       the code block layout to optimize the program performance.
       (4) Enable LTO. There are two types of LTO: full and thin, which correspond to -flto=full and -flto=thin. Full LTO delivers superior performance but requires a
       longer compilation time. Thin LTO delivers inferior performance but needs a shorter compilation time. To enable LTO for the compilation, add a link time option
       to the optimization options of 1) to 3). For example, for -mllvm -enable-split-machine-functions, prefix it with -fuse-ld=lld -Wl, that is, -fuse-ld=lld
       -Wl,-mllvm,-enable-split-machine-functions.
       The DevKit provides the autoFDO capability to automatically adjust compilation based on the feedback result.
       (1) You can use autofdo for the tuning: devkit advisor kfdo -h
    2. The percentage of Backend Bound is high.(Threshold: 20.00%)
       Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
       suggestions. Verify the optimization suggestions in your specific application scenario.
       (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose
       size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation.
       (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is
       more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved.
       (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The
       compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the
       prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3,
       y=9, z=940.
       (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran.
       (5) Try enabling huge pages.
The report /home/topdown_app.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/topdown_app.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

When the Frontend Bound metric is higher than 20%, the Kunpeng DevKit provides the AutoFDO capability to automatically adjust the compilation based on the feedback result. You can run devkit advisor kfdo -h to view the details.

The command output is the overview about the microarchitecture analysis task. You can use the --package parameter to generate a TAR package and import the package to the WebUI for visualized information. For details, see contents about importing tasks in Task Management.

Parent topic: System Profiler