Microarchitecture Analysis
Command Function
Obtains the running status of instructions on the CPU pipeline based on Arm performance monitor unit (PMU) events, helping quickly locate performance bottlenecks of the current application on the CPU. You can modify your application to make full use of hardware resources.
Syntax
1 | devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-o] [--package] [workload workload...] |
devkit tuner top-down [workload workload...] can be used to collect data of a specified application. Replace [workload workload...] in the command with the application path and application parameter. If the -c/--cpu and -p/--pid parameters both exist, data specified by the -p parameter is preferentially collected.
Parameter Description
Parameter |
Option |
Description |
|---|---|---|
-h/--help |
- |
Obtains help information. |
-c/--cpu |
- |
Number of CPU cores to be collected. The value can be 0 or 0, 1, 2 or 0-2. |
-d/--duration |
- |
Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. |
-D/--delay |
- |
Collection delay, which defaults to 0 seconds and must be less than the collection duration. |
-i/--interval |
- |
Collection interval, which defaults to 1 second. If the collection duration is set, the collection interval must be less than or equal to the collection duration. |
-l/--log-level |
0/1/2/3 |
Log level, which defaults to 1.
|
-L/--profile-level |
0/1/2/3/4/5/6 |
Analysis metric, which defaults to 0.
|
-o/--output |
- |
Report package name and output path. If you enter a name only, the report package is generated in the current directory by default. This option must be used together with --package. |
-r/--collection-range |
user/kernel/all |
Process collection level. When -p/--pid is set to ALL, the option user or kernel can be selected, which means that user-mode processes or kernel-mode processes can be collected. The default value is all, which collects user-mode and kernel-mode performance data.
|
-p/--pid |
PID/PID1, PID2/ALL |
ID of a process to be collected. Separate multiple PIDs with commas (,). The default value is ALL. If both the -p and -c parameters are used, the processes with the specified PIDs are preferentially collected. |
--package |
- |
Indicates whether to generate a report data package. If you do not set the package name or path, the top-down-timestamp.tar package is generated in the current directory by default. |
Example
- Collection based on CPUs:
1devkit tuner top-down -c 0-127 -d 3 -o /home/topdown_cpu -L 2 --package
The -c 0-127 parameter in this command collects CPU cores 0 to 127 with a collection duration of 3 seconds. The -o /home/topdown_cpu and --package parameters generate a report data package named topdown_cpu to a specified path. The -L 2 parameter collects the Back-End Bound->Core Bound instruction data.
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
TOP-DOWN Summary Report-ALL Time:2024/08/06 15:41:27 ======================================================================= Top-down metrics of the system: Cycles 244,796,223 Instructions 138,949,659 IPC 0.57 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 15.31 -- Frontend Bound 40.35 fetch_bubble Retiring 14.19 inst_retired Backend Bound 30.15 -- ├── Resource Bound 4.43 -- ├── Core Bound 15.42 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 15.41 -- │ ├── ALU BRU IssueQ Full 0.61 -- │ ├── LS IssueQ Full 1.14 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 10.26 -- ────────────────────────────────────────────────────────────────── 3009 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] The report /home/topdown_cpu.tar is generated successfully. To view summary report. you can run: devkit report -i /home/topdown_cpu.tar To view detail report. you can import the report to the WebUI or IDE to view details.
- Collection based on process IDs:
1devkit tuner top-down -p 12540 -d 3 -o /home/topdown_pid --package
- In this command, -p 12540 collects the process whose ID is 12540 with a collection duration of 3 seconds. -o /home/topdown_pid and --package generate a report data package named topdown_pid to a specified path. If -L is not specified, data of all dimensions is collected.
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
TOP-DOWN Summary Report-ALL Time:2024/08/06 15:48:48 ======================================================================= Top-down metrics of process id '1884856': Cycles 1,488,556,148 Instructions 1,480,811,195 IPC 0.99 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 55.99 -- ├── Branch Mispredicts 55.86 br_mis_pred │ ├── Indirect Branch 0.00 -- │ ├── Push Branch 0.00 -- │ ├── Pop Branch 0.00 -- │ └── Other Branch 55.86 -- └── Machine Clears 0.12 -- ├── Nuke Flush 0.02 -- └── Other Flush 0.09 -- Frontend Bound 12.48 fetch_bubble ├── Fetch Latency Bound 9.42 -- │ ├── ITLB Miss 0.02 -- │ │ ├── L1 Tlb 0.02 -- │ │ └── L2 Tlb 0.00 l2i_tlb_refill │ ├── ICache Miss 0.43 -- │ │ ├── L1 Cache 0.11 -- │ │ └── L2 Cache 0.32 l2i_cache_refill │ ├── Branch Mispredict Flush 8.91 br_mis_pred │ ├── OoO Flush 0.01 -- │ └── Static Predictor Flush 0.05 -- └── Fetch Bandwidth Bound 3.05 -- Retiring 24.86 inst_retired Backend Bound 6.66 -- ├── Resource Bound 0.07 -- │ ├── Sync Stall 0.00 -- │ ├── Reorder Buffer Stall 0.00 -- │ ├── Physical Tag Stall 0.07 -- │ ├── SaveOp Queue Stall 0.00 -- │ ├── PC Buffer Stall 0.00 -- │ └── Other Stall 0.00 -- ├── Core Bound 4.80 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 4.79 -- │ ├── ALU BRU IssueQ Full 0.03 -- │ ├── LS IssueQ Full 0.17 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 1.77 -- ├── L1 Bound 1.73 -- ├── L2 Bound 0.03 -- ├── L3 or DRAM Bound 0.01 cache-misses └── Store Bound 0.00 -- ────────────────────────────────────────────────────────────────── 3000 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] The report /home/topdown_pid.tar is generated successfully. To view summary report. you can run: devkit report -i /home/topdown_pid.tar To view detail report. you can import the report to the WebUI or IDE to view details.
Preferred Sampling Event displays key events that affect the microarchitecture binding. You can optimize the binding effect by tuning key events. You can use devkit tuner hotspot -e [Preferred Sampling Event] for the analysis and tuning.
- Collection based on applications:
1devkit tuner top-down -d 10 -o /home/topdown_app -L 2 --package /opt/testdemo/topdown_suggest
The collection duration in this command is 10 seconds. The -o /home/topdown_app and --package parameters generate a report data package named topdown_app to a specified path. The -L 2 parameter collects the Back-End Bound->Core Bound instruction data.
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
TOP-DOWN Summary Report-ALL Time:2024/12/25 17:45:45 ======================================================================= Top-down metrics of /opt/testdemo/topdown_suggest: Cycles 25,998,441,500 Instructions 27,700,695,734 IPC 1.07 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 0.01 -- Frontend Bound 25.94 fetch_bubble Retiring 26.63 inst_retired Backend Bound 47.42 -- ├── Resource Bound 7.74 -- ├── Core Bound 31.47 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 31.46 -- │ ├── ALU BRU IssueQ Full 15.67 -- │ ├── LS IssueQ Full 3.11 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 8.20 -- ────────────────────────────────────────────────────────────────── 10000 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] Optimization Suggestions 1. The percentage of Frontend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Set the Inline parameter: -mllvm -inline-threshold=1550 (1550 is an optimal empirical value). You are advised to enable LTO in advance. (2) Adjust the alignment of functions, that of basic blocks, and that of basic blocks without jumping: -mllvm -align-all-functions=2^n -mllvm -align-all-blocks=2^n -mllvm -align-all-nofallthru-blocks=2^n, where the value of 2^n is 32 or 64 and can be changed to 16 or 128 if needed. (3) Enable PGO: -mllvm -enable-split-machine-functions. After PGO is enabled, the compiler splits functions based on the popularity of basic blocks and adjusts the code block layout to optimize the program performance. (4) Enable LTO. There are two types of LTO: full and thin, which correspond to -flto=full and -flto=thin. Full LTO delivers superior performance but requires a longer compilation time. Thin LTO delivers inferior performance but needs a shorter compilation time. To enable LTO for the compilation, add a link time option to the optimization options of 1) to 3). For example, for -mllvm -enable-split-machine-functions, prefix it with -fuse-ld=lld -Wl, that is, -fuse-ld=lld -Wl,-mllvm,-enable-split-machine-functions. The DevKit provides the autoFDO capability to automatically adjust compilation based on the feedback result. (1) You can use autofdo for the tuning: devkit advisor kfdo -h 2. The percentage of Backend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation. (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved. (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940. (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran. (5) Try enabling huge pages. The report /home/topdown_app.tar is generated successfully. To view summary report. you can run: devkit report -i /home/topdown_app.tar To view detail report. you can import the report to the WebUI or IDE to view details.
When the Frontend Bound metric is higher than 20%, the Kunpeng DevKit provides the AutoFDO capability to automatically adjust the compilation based on the feedback result. You can run devkit advisor kfdo -h to view the details.
The command output is the overview about the microarchitecture analysis task. You can use the --package parameter to generate a TAR package and import the package to the WebUI for visualized information. For details, see contents about importing tasks in Task Management.