Microarchitecture Analysis
Based on Arm performance monitor unit (PMU) events, you can learn the running status of instructions on the CPU pipeline. You can modify your application accordingly to make full use of your hardware resources.
Command Function
Analyzes the running status of instructions on the CPU pipeline based on Arm PMU events, helping quickly locate performance bottlenecks of the current application on the CPUs.
Syntax
1 | devkit tuner top-down [-h] [-c {n | n,m | n-m}] [-d <sec>] [-D <sec>] [-l {0, 1, 2, 3}] [-L {0, 1, 2, 3, 4, 5, 6}] [-i <sec>] [-p {PID | PID1,PID2 | ALL}] [-r {user, kernel, all}] [-o] [--package] [workload workload...] |
devkit tuner top-down [workload workload...] can be used to collect data of a specified application. Replace [workload workload...] in the command with the application path and application parameter. If the -c/--cpu and -p/--pid parameters both exist, data specified by the -p parameter is preferentially collected.
Parameter Description
Parameter |
Option |
Description |
|---|---|---|
-h/--help |
- |
Obtains help information. This parameter is optional. |
-c/--cpu |
- |
Numbers of CPU cores to be collected, for example, 0, 0,1,2, and 0-2. This parameter is optional. |
-d/--duration |
- |
Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. This parameter is optional. |
-D/--delay |
- |
Collection delay, which defaults to 0, in seconds, and must be less than the collection duration. This parameter is optional. |
-l/--log-level |
0/1/2/3 |
Log level, which defaults to 1. This parameter is optional.
|
-L/--profile-level |
0/1/2/3/4/5/6 |
Analysis metric, which defaults to 0. This parameter is optional.
|
-i/--interval |
- |
Collection interval, which defaults to 1, in seconds. If the collection duration is set, the collection interval must be less than or equal to the configured collection duration. This parameter is optional. |
-p/--pid |
- |
ID of a process to be collected. Separate multiple PIDs with commas (,). The default value is ALL. This parameter is optional. If both the -p and -c parameters are used, only the processes with the specified PIDs are collected. |
-r/--collection-range |
user/kernel/all |
Process collection level. When -p/--pid is set to ALL, the option user or kernel can be selected, which means that user-mode processes or kernel-mode processes can be collected. This parameter is optional. The default value is all, which collects user-mode and kernel-mode performance data.
|
-o/--output |
- |
Report package name and output path (no package name extension required). If you enter a name only, the report package is generated in the current directory by default. This option must be used together with --package. This parameter is optional. |
--package |
- |
Indicates whether to generate a report data package. If you do not set the package name or path, the top-down-Timestamp.tar package is generated in the current directory by default. This parameter is optional. |
Example
- Collect CPU data.
1devkit tuner top-down -c 0-127 -d 3 -o /home/topdown_cpu -L 2 --package
The -c 0-127 parameter in this command collects CPU cores 0 to 127 with a collection duration of 3 seconds. The -o /home/topdown_cpu and --package parameters generate a report data package named topdown_cpu to a specified path. The -L 2 parameter collects the Back-End Bound->Core Bound instruction data.
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
TOP-DOWN Summary Report-ALL Time:2024/08/06 15:41:27 ======================================================================= Top-down metrics of the system: Cycles 244,796,223 Instructions 138,949,659 IPC 0.57 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 15.31 -- Frontend Bound 40.35 fetch_bubble Retiring 14.19 inst_retired Backend Bound 30.15 -- ├── Resource Bound 4.43 -- ├── Core Bound 15.42 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 15.41 -- │ ├── ALU BRU IssueQ Full 0.61 -- │ ├── LS IssueQ Full 1.14 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 10.26 -- ────────────────────────────────────────────────────────────────── 3009 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] The report /home/topdown_cpu.tar is generated successfully. To view summary report. you can run: devkit report -i /home/topdown_cpu.tar To view detail report. you can import the report to the WebUI or IDE to view details.
- Collect process IDs.
1devkit tuner top-down -p 12540 -d 3 -o /home/topdown_pid --package
In this command, -p 12540 collects the process whose ID is 12540 with a collection duration of 3 seconds. -o /home/topdown_pid and --package generate a report data package named topdown_pid to a specified path. If -L is not specified, data of all dimensions is collected.
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
TOP-DOWN Summary Report-ALL Time:2024/08/06 15:48:48 ======================================================================= Top-down metrics of process id '1884856': Cycles 1,488,556,148 Instructions 1,480,811,195 IPC 0.99 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 55.99 -- ├── Branch Mispredicts 55.86 br_mis_pred │ ├── Indirect Branch 0.00 -- │ ├── Push Branch 0.00 -- │ ├── Pop Branch 0.00 -- │ └── Other Branch 55.86 -- └── Machine Clears 0.12 -- ├── Nuke Flush 0.02 -- └── Other Flush 0.09 -- Frontend Bound 12.48 fetch_bubble ├── Fetch Latency Bound 9.42 -- │ ├── ITLB Miss 0.02 -- │ │ ├── L1 Tlb 0.02 -- │ │ └── L2 Tlb 0.00 l2i_tlb_refill │ ├── ICache Miss 0.43 -- │ │ ├── L1 Cache 0.11 -- │ │ └── L2 Cache 0.32 l2i_cache_refill │ ├── Branch Mispredict Flush 8.91 br_mis_pred │ ├── OoO Flush 0.01 -- │ └── Static Predictor Flush 0.05 -- └── Fetch Bandwidth Bound 3.05 -- Retiring 24.86 inst_retired Backend Bound 6.66 -- ├── Resource Bound 0.07 -- │ ├── Sync Stall 0.00 -- │ ├── Reorder Buffer Stall 0.00 -- │ ├── Physical Tag Stall 0.07 -- │ ├── SaveOp Queue Stall 0.00 -- │ ├── PC Buffer Stall 0.00 -- │ └── Other Stall 0.00 -- ├── Core Bound 4.80 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 4.79 -- │ ├── ALU BRU IssueQ Full 0.03 -- │ ├── LS IssueQ Full 0.17 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 1.77 -- ├── L1 Bound 1.73 -- ├── L2 Bound 0.03 -- ├── L3 or DRAM Bound 0.01 cache-misses └── Store Bound 0.00 -- ────────────────────────────────────────────────────────────────── 3000 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] The report /home/topdown_pid.tar is generated successfully. To view summary report. you can run: devkit report -i /home/topdown_pid.tar To view detail report. you can import the report to the WebUI or IDE to view details.
Preferred Sampling Event displays key events that affect the microarchitecture binding. You can optimize the binding effect by tuning key events. You can use devkit tuner hotspot -e [Preferred Sampling Event] for the analysis and tuning.
- Collect application data.
1devkit tuner top-down -d 10 -o /home/topdown_app -L 2 --package /opt/testdemo/topdown_suggest
The collection duration in this command is 10 seconds. The -o /home/topdown_app and --package parameters generate a report data package named topdown_app to a specified path. The -L 2 parameter collects the Back-End Bound->Core Bound instruction data.
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
TOP-DOWN Summary Report-ALL Time:2024/12/25 17:45:45 ======================================================================= Top-down metrics of /opt/testdemo/topdown_suggest: Cycles 25,998,441,500 Instructions 27,700,695,734 IPC 1.07 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 0.01 -- Frontend Bound 25.94 fetch_bubble Retiring 26.63 inst_retired Backend Bound 47.42 -- ├── Resource Bound 7.74 -- ├── Core Bound 31.47 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 31.46 -- │ ├── ALU BRU IssueQ Full 15.67 -- │ ├── LS IssueQ Full 3.11 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 8.20 -- ────────────────────────────────────────────────────────────────── 10000 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] Optimization Suggestions 1. The percentage of Frontend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Set the Inline parameter: -mllvm -inline-threshold=1550 (1550 is an optimal empirical value). You are advised to enable LTO in advance. (2) Adjust the alignment of functions, that of basic blocks, and that of basic blocks without jumping: -mllvm -align-all-functions=2^n -mllvm -align-all-blocks=2^n -mllvm -align-all-nofallthru-blocks=2^n, where the value of 2^n is 32 or 64 and can be changed to 16 or 128 if needed. (3) Enable PGO: -mllvm -enable-split-machine-functions. After PGO is enabled, the compiler splits functions based on the popularity of basic blocks and adjusts the code block layout to optimize the program performance. (4) Enable LTO. There are two types of LTO: full and thin, which correspond to -flto=full and -flto=thin. Full LTO delivers superior performance but requires a longer compilation time. Thin LTO delivers inferior performance but needs a shorter compilation time. To enable LTO for the compilation, add a link time option to the optimization options of 1) to 3). For example, for -mllvm -enable-split-machine-functions, prefix it with -fuse-ld=lld -Wl, that is, -fuse-ld=lld -Wl,-mllvm,-enable-split-machine-functions. 2. The percentage of Backend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation. (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved. (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940. (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran. (5) Try enabling huge pages. The report /home/topdown_app.tar is generated successfully. To view summary report. you can run: devkit report -i /home/topdown_app.tar To view detail report. you can import the report to the WebUI or IDE to view details.
- View the generated report.
1devkit report -i /home/topdown_app.tar
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
TOP-DOWN Summary Report-ALL Time:2024/12/25 17:45:45 ======================================================================= Top-down metrics of /opt/testdemo/topdown_suggest: Cycles 25,998,441,500 Instructions 27,700,695,734 IPC 1.07 ────────────────────────────────────────────────────────────────── Top-down Metrics Bound(%) Preferred Sampling Event ────────────────────────────────────────────────────────────────── Bad Speculation 0.01 -- Frontend Bound 25.94 fetch_bubble Retiring 26.63 inst_retired Backend Bound 47.42 -- ├── Resource Bound 7.74 -- ├── Core Bound 31.47 -- │ ├── Divider Stall 0.00 -- │ ├── FSU Stall 0.00 -- │ └── Exe Ports Util 31.46 -- │ ├── ALU BRU IssueQ Full 15.67 -- │ ├── LS IssueQ Full 3.11 -- │ └── FSU IssueQ Full 0.00 -- └── Memory Bound 8.20 -- ────────────────────────────────────────────────────────────────── 10000 milliseconds time elapsed Note: To view the hotspot data. You can run devkit tuner hotspot -e [Preferred Sampling Event] Optimization Suggestions 1. The percentage of Frontend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Set the Inline parameter: -mllvm -inline-threshold=1550 (1550 is an optimal empirical value). You are advised to enable LTO in advance. (2) Adjust the alignment of functions, that of basic blocks, and that of basic blocks without jumping: -mllvm -align-all-functions=2^n -mllvm -align-all-blocks=2^n -mllvm -align-all-nofallthru-blocks=2^n, where the value of 2^n is 32 or 64 and can be changed to 16 or 128 if needed. (3) Enable PGO: -mllvm -enable-split-machine-functions. After PGO is enabled, the compiler splits functions based on the popularity of basic blocks and adjusts the code block layout to optimize the program performance. (4) Enable LTO. There are two types of LTO: full and thin, which correspond to -flto=full and -flto=thin. Full LTO delivers superior performance but requires a longer compilation time. Thin LTO delivers inferior performance but needs a shorter compilation time. To enable LTO for the compilation, add a link time option to the optimization options of 1) to 3). For example, for -mllvm -enable-split-machine-functions, prefix it with -fuse-ld=lld -Wl, that is, -fuse-ld=lld -Wl,-mllvm,-enable-split-machine-functions. 2. The percentage of Backend Bound is high.(Threshold: 20.00%) Take the following optimization measures for C/C++ applications compiled using the BiSheng compiler. For other compilers, you can refer to the optimization suggestions. Verify the optimization suggestions in your specific application scenario. (1) Use the jemalloc library. Associate the libjemalloc.so soft link in the lib directory of the BiSheng compiler with the jemalloc dynamic library entity whose size is the same as the system page table size in this directory, and add the -ljemalloc parameter for the compilation. (2) Set Wrap-memset/memcpy: Wl,-wrap=memset/memcpy -lstringlib. The BiSheng compiler provides memset/memcpy implementation in the libstring library, which is more adaptable to the AArch64 architecture. When the glibc version is earlier and the function proportion is high, the performance is significantly improved. (3) Set prefetch to save the data to be accessed to the cache, so as to reduce the value of d-cache miss. The hardware has its own prefetch mechanism. The compiler supports the software prefetch function. When tsv110 is enabled, the BiSheng compiler automatically enables software prefetch. You can adjust the prefetch density by using the three parameters: -mllvm -prefetch-loop-depth=x -mllvm -min-prefetch-stride=y -mllvm -prefetch-distance=z, where for example, x=3, y=9, z=940. (4) Add the -fstack-arrays parameter to place all arrays onto the stack. The parameter takes effect only on Fortran. (5) Try enabling huge pages.