Microarchitecture Analysis

Based on Arm performance monitor unit (PMU) events, you can learn the running status of instructions on the CPU pipeline. You can modify your application accordingly to make full use of your hardware resources.

Command Function

Analyzes the running status of instructions on the CPU pipeline based on Arm PMU events, helping quickly locate performance bottlenecks of the current application on the CPUs.

Syntax

devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-o] [--package] [workload workload...]

devkit tuner top-down [workload workload...] can be used to collect data of a specified application. Replace [workload workload...] in the command with the application path and application parameter. If the -c/--cpu and -p/--pid parameters both exist, data specified by the -p parameter is preferentially collected.

Parameter Description

**Table 1** Parameter description
Parameter	Option	Description
-h/--help	-	Obtains help information. This parameter is optional.
-c/--cpu	-	Numbers of CPU cores to be collected, for example, 0, 0,1,2, and 0-2. This parameter is optional.
-d/--duration	-	Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. This parameter is optional.
-D/--delay	-	Collection delay, which defaults to 0, in seconds, and must be less than the collection duration. This parameter is optional.
-l/--log-level	0/1/2/3	Log level, which defaults to 1. This parameter is optional. 0: DEBUG 1: INFO 2: WARNING 3: ERROR
-L/--profile-level	0/1/2/3/4/5/6	Analysis metric, which defaults to 0. This parameter is optional. 0: Data of all dimensions is collected and a result is generated. 1: Back-End Bound, Bad Speculation, Front-End Bound, and Retiring are collected. 2: The Back-End Bound->Core Bound collection is performed. Back-End is the processor portion that performs out-of-order dispatch and execution of micro-ops (uOps) and returns results. Core Bound is a subclass of Back-End Bound. It reflects the ratio of performance bottlenecks due to insufficient CPU execution unit resources. 3: The Back-End Bound->Memory Bound collection is performed. Back-End is the processor portion that performs out-of-order dispatch and execution of uOps and returns results. Memory Bound is a subclass of Back-End Bound. It reflects pipeline stalls due to data read/write waiting. 4: The Back-End Bound->Resource Bound collection is performed (Kunpeng 920). Back-End is the processor portion that performs out-of-order dispatch and execution of uOps and returns results. Resource Bound is a subclass of Back-End Bound. It reflects pipeline stalls that occur when uOps are dispatched to an out-of-order execution scheduler due to insufficient resources. 5: Bad Speculation is collected. It reflects pipeline resources waste due to incorrect instruction speculations. 6: Front-End Bound is collected. It is a part of a processor where instructions are fetched and decoded into uOps for the back-end pipeline execution. This metric reflects the proportion of processor front-end resources that are under-utilized.
-i/--interval	-	Collection interval, which defaults to 1, in seconds. If the collection duration is set, the collection interval must be less than or equal to the configured collection duration. This parameter is optional.
-p/--pid	-	ID of a process to be collected. Separate multiple PIDs with commas (,). The default value is ALL. This parameter is optional. If both the -p and -c parameters are used, only the processes with the specified PIDs are collected.
-r/--collection-range	user/kernel/all	Process collection level. When -p/--pid is set to ALL, the option user or kernel can be selected, which means that user-mode processes or kernel-mode processes can be collected. This parameter is optional. The default value is all, which collects user-mode and kernel-mode performance data. user: collects user-mode performance data. kernel: collects kernel-mode performance data. all: collects user-mode and kernel-mode performance data.
-o/--output	-	Report package name and output path (no package name extension required). If you enter a name only, the report package is generated in the current directory by default. This option must be used together with --package. This parameter is optional.
--package	-	Indicates whether to generate a report data package. If you do not set the package name or path, the top-down-Timestamp.tar package is generated in the current directory by default. This parameter is optional.

Example

Collect CPU data.

devkit tuner top-down -c 0-127 -d 3 -o /home/topdown_cpu -L 2 --package

The -c 0-127 parameter in this command collects CPU cores 0 to 127 with a collection duration of 3 seconds. The -o /home/topdown_cpu and --package parameters generate a report data package named topdown_cpu to a specified path. The -L 2 parameter collects the Back-End Bound->Core Bound instruction data.

Command output:

TOP-DOWN Summary Report-ALL                    Time:2024/08/06 15:41:27
=======================================================================

Top-down metrics of the system:
Cycles                     244,796,223
Instructions               138,949,659
IPC                               0.57

──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                            15.31    --

  Frontend Bound                             40.35    fetch_bubble

  Retiring                                   14.19    inst_retired

  Backend Bound                              30.15    --
  ├── Resource Bound                       4.43    --
  ├── Core Bound                          15.42    --
  │   ├── Divider Stall                   0.00    --
  │   ├── FSU Stall                       0.00    --
  │   └── Exe Ports Util                 15.41    --
  │       ├── ALU BRU IssueQ Full         0.61    --
  │       ├── LS IssueQ Full              1.14    --
  │       └── FSU IssueQ Full             0.00    --
  └── Memory Bound                        10.26    --
──────────────────────────────────────────────────────────────────
3009 milliseconds time elapsed

Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]

The report /home/topdown_cpu.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/topdown_cpu.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

Collect process IDs.

devkit tuner top-down -p 12540 -d 3 -o /home/topdown_pid --package

In this command, -p 12540 collects the process whose ID is 12540 with a collection duration of 3 seconds. -o /home/topdown_pid and --package generate a report data package named topdown_pid to a specified path. If -L is not specified, data of all dimensions is collected.

Command output:

TOP-DOWN Summary Report-ALL                    Time:2024/08/06 15:48:48
=======================================================================

Top-down metrics of process id '1884856':
Cycles                   1,488,556,148
Instructions             1,480,811,195
IPC                               0.99

──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                            55.99    --
  ├── Branch Mispredicts                  55.86    br_mis_pred
  │   ├── Indirect Branch                 0.00    --
  │   ├── Push Branch                     0.00    --
  │   ├── Pop Branch                      0.00    --
  │   └── Other Branch                   55.86    --
  └── Machine Clears                       0.12    --
      ├── Nuke Flush                       0.02    --
      └── Other Flush                      0.09    --

  Frontend Bound                             12.48    fetch_bubble
  ├── Fetch Latency Bound                  9.42    --
  │   ├── ITLB Miss                       0.02    --
  │   │   ├── L1 Tlb                     0.02    --
  │   │   └── L2 Tlb                     0.00    l2i_tlb_refill
  │   ├── ICache Miss                     0.43    --
  │   │   ├── L1 Cache                   0.11    --
  │   │   └── L2 Cache                   0.32    l2i_cache_refill
  │   ├── Branch Mispredict Flush         8.91    br_mis_pred
  │   ├── OoO Flush                       0.01    --
  │   └── Static Predictor Flush          0.05    --
  └── Fetch Bandwidth Bound                3.05    --

  Retiring                                   24.86    inst_retired

  Backend Bound                               6.66    --
  ├── Resource Bound                       0.07    --
  │   ├── Sync Stall                      0.00    --
  │   ├── Reorder Buffer Stall            0.00    --
  │   ├── Physical Tag Stall              0.07    --
  │   ├── SaveOp Queue Stall              0.00    --
  │   ├── PC Buffer Stall                 0.00    --
  │   └── Other Stall                     0.00    --
  ├── Core Bound                           4.80    --
  │   ├── Divider Stall                   0.00    --
  │   ├── FSU Stall                       0.00    --
  │   └── Exe Ports Util                  4.79    --
  │       ├── ALU BRU IssueQ Full         0.03    --
  │       ├── LS IssueQ Full              0.17    --
  │       └── FSU IssueQ Full             0.00    --
  └── Memory Bound                         1.77    --
      ├── L1 Bound                         1.73    --
      ├── L2 Bound                         0.03    --
      ├── L3 or DRAM Bound                 0.01    cache-misses
      └── Store Bound                      0.00    --
──────────────────────────────────────────────────────────────────
3000 milliseconds time elapsed

Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]

The report /home/topdown_pid.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/topdown_pid.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

Preferred Sampling Event displays key events that affect the microarchitecture binding. You can optimize the binding effect by tuning key events. You can use devkit tuner hotspot -e [Preferred Sampling Event] for the analysis and tuning.

Collect application data.

devkit tuner top-down -d 10 -o /home/topdown_app -L 2 --package /opt/testdemo/topdown_suggest

The collection duration in this command is 10 seconds. The -o /home/topdown_app and --package parameters generate a report data package named topdown_app to a specified path. The -L 2 parameter collects the Back-End Bound->Core Bound instruction data.

Command output:

TOP-DOWN Summary Report-ALL                    Time:2024/12/25 17:45:45
=======================================================================
Top-down metrics of /opt/testdemo/topdown_suggest:
Cycles              25,998,441,500
Instructions        27,700,695,734
IPC                 1.07
──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                             0.01    --
  Frontend Bound                             25.94    fetch_bubble
  Retiring                                   26.63    inst_retired
  Backend Bound                              47.42    --
  ├── Resource Bound                          7.74    --
  ├── Core Bound                             31.47    --
  │   ├── Divider Stall                       0.00    --
  │   ├── FSU Stall                           0.00    --
  │   └── Exe Ports Util                     31.46    --
  │       ├── ALU BRU IssueQ Full            15.67    --
  │       ├── LS IssueQ Full                  3.11    --
  │       └── FSU IssueQ Full                 0.00    --
  └── Memory Bound                            8.20    --
──────────────────────────────────────────────────────────────────
10000 milliseconds time elapsed
Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
Optimization Suggestions
    1. The percentage of Frontend Bound is high.(Threshold: 20.00%)
       Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
       suggestions. Verify the optimization suggestions in your specific application scenario.
       (1) Set the Inline parameter: -mllvm -inline-threshold=1550 (1550 is an optimal empirical value). You are advised to enable LTO in advance.
       (2) Adjust the alignment of functions, that of basic blocks, and that of basic blocks without jumping: -mllvm -align-all-functions=2^n -mllvm
       -align-all-blocks=2^n -mllvm -align-all-nofallthru-blocks=2^n, where the value of 2^n is 32 or 64 and can be changed to 16 or 128 if needed.
       (3) Enable PGO: -mllvm -enable-split-machine-functions. After PGO is enabled, the compiler splits functions based on the popularity of basic blocks and adjusts
       the code block layout to optimize the program performance.
       (4) Enable LTO. There are two types of LTO: full and thin, which correspond to -flto=full and -flto=thin. Full LTO delivers superior performance but requires a
       longer compilation time. Thin LTO delivers inferior performance but needs a shorter compilation time. To enable LTO for the compilation, add a link time option
       to the optimization options of 1) to 3). For example, for -mllvm -enable-split-machine-functions, prefix it with -fuse-ld=lld -Wl, that is, -fuse-ld=lld
       -Wl,-mllvm,-enable-split-machine-functions.
    2. The percentage of Backend Bound is high.(Threshold: 20.00%)
       Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
       suggestions. Verify the optimization suggestions in your specific application scenario.
       (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose
       size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation.
       (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is
       more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved.
       (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The
       compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the
       prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3,
       y=9, z=940.
       (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran.
       (5) Try enabling huge pages.
The report /home/topdown_app.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/topdown_app.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

View the generated report.

devkit report -i /home/topdown_app.tar

Command output:

TOP-DOWN Summary Report-ALL                    Time:2024/12/25 17:45:45
=======================================================================
Top-down metrics of /opt/testdemo/topdown_suggest:
Cycles              25,998,441,500
Instructions        27,700,695,734
IPC                 1.07
──────────────────────────────────────────────────────────────────
  Top-down Metrics                         Bound(%)    Preferred Sampling Event
──────────────────────────────────────────────────────────────────
  Bad Speculation                             0.01    --
  Frontend Bound                             25.94    fetch_bubble
  Retiring                                   26.63    inst_retired
  Backend Bound                              47.42    --
  ├── Resource Bound                          7.74    --
  ├── Core Bound                             31.47    --
  │   ├── Divider Stall                       0.00    --
  │   ├── FSU Stall                           0.00    --
  │   └── Exe Ports Util                     31.46    --
  │       ├── ALU BRU IssueQ Full            15.67    --
  │       ├── LS IssueQ Full                  3.11    --
  │       └── FSU IssueQ Full                 0.00    --
  └── Memory Bound                            8.20    --
──────────────────────────────────────────────────────────────────
10000 milliseconds time elapsed
Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event]
Optimization Suggestions
    1. The percentage of Frontend Bound is high.(Threshold: 20.00%)
       Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
       suggestions. Verify the optimization suggestions in your specific application scenario.
       (1) Set the Inline parameter: -mllvm -inline-threshold=1550 (1550 is an optimal empirical value). You are advised to enable LTO in advance.
       (2) Adjust the alignment of functions, that of basic blocks, and that of basic blocks without jumping: -mllvm -align-all-functions=2^n -mllvm
       -align-all-blocks=2^n -mllvm -align-all-nofallthru-blocks=2^n, where the value of 2^n is 32 or 64 and can be changed to 16 or 128 if needed.
       (3) Enable PGO: -mllvm -enable-split-machine-functions. After PGO is enabled, the compiler splits functions based on the popularity of basic blocks and adjusts
       the code block layout to optimize the program performance.
       (4) Enable LTO. There are two types of LTO: full and thin, which correspond to -flto=full and -flto=thin. Full LTO delivers superior performance but requires a
       longer compilation time. Thin LTO delivers inferior performance but needs a shorter compilation time. To enable LTO for the compilation, add a link time option
       to the optimization options of 1) to 3). For example, for -mllvm -enable-split-machine-functions, prefix it with -fuse-ld=lld -Wl, that is, -fuse-ld=lld
       -Wl,-mllvm,-enable-split-machine-functions.
    2. The percentage of Backend Bound is high.(Threshold: 20.00%)
       Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization
       suggestions. Verify the optimization suggestions in your specific application scenario.
       (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose
       size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation.
       (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is
       more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved.
       (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The
       compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the
       prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3,
       y=9, z=940.
       (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran.
       (5) Try enabling huge pages.

Parent topic: System Profiler