Memory Access Statistics Analysis
A memory access unit is the most complex logic control unit in the CPU. This unit is responsible for handling various problems in a process of executing memory access instructions such as Load and Store and ensuring high-speed execution. With memory access statistics analysis, you can find those processes that may cause performance problems.
Command Function
Accesses the PMU events of the cache and memory and analyzes the number of storage access times, hit rate, and bandwidth.
Syntax
1 | devkit tuner memory [-h] [-d <sec>] [-l {0, 1, 2, 3}] [-i <sec>] [-m {1, 2, 3, 4}] [-P {100, 1000}] [-c {n,m | n-m}] |
Parameter Description
Parameter |
Option |
Description |
|---|---|---|
-h/--help |
- |
Obtains help information. This parameter is optional. |
-d/--duration |
- |
Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. This parameter is optional. |
-l/--log-level |
0/1/2/3 |
Log level, which defaults to 1. This parameter is optional.
|
-i/--interval |
- |
Task collection interval, in seconds. This parameter is optional. The minimum value is 1 second and the maximum value cannot exceed the collection duration. The default value is the collection duration. If this parameter is not set, no subreports are generated. It specifies the time taken to collect data in each subreport. |
-m/--metric |
1/2/3/4 |
Sampling type, which defaults to 1. This parameter is optional.
|
-P/--period |
100/1000 |
Actual data collection interval, which defaults to 1000 ms. This parameter is optional. The options are or 1000 ms or 100 ms. When Collection Duration is set to 1 second, the default value automatically changes to 100 ms. |
-c/--cpu |
- |
Number of CPU cores to be collected. The value can be 0 or 0, 1, 2 or 0-2. This parameter is optional. By default, all CPU cores are collected. |
Example
1 | devkit tuner memory -d 2 -m 1 |
In the command, -d indicates that the collection duration is 2 seconds, and -m 1 indicates that all the cache access, DDR access, and HBM bandwidth information is collected. (The HBM bandwidth information is displayed only when the environment supports HBM information collection.)
Command output:
Memory Summary Report-ALL Time:2025/12/09 09:57:23 ================================================================================ System Information ─────────────────────────────────────────────────────────────── Linux Kernel Version 4.19.25-203.el7.bclinux.aarch64 Cpu Type Kunpeng 920 NUMA NODE(cpus) 0(0-31) 1(32-63) 2(64-95) 3(96-127) Percentage of core Cache miss ─────────────────────────────────────────────────────────────── L1D 3.47% L1I 0.01% L2D 58.88% L2I 35.26% DDR Bandwidth (system wide) ─────────────────────────────────────────────────────────────── ddrc_write 658.03MB/s ddrc_read 16900.26MB/s Memory metrics of the Cache ─────────────────────────────────────────────────────────────── 1. L1/L2/TLB Access Bandwidth and Hit Rate Value Format: X|Y = Bandwidth | Hit Rate ─────────────────────────────────────────────────────────────── CPU L1D L1I L2D L2I L2D_TLB L2I_TLB ─────────────────────────────────────────────────────────────── all 81581.38MB/s|96.53% 201888.73MB/s|99.99% 35588.65MB/s|41.12% 72.89MB/s|64.74% N/A|57.10% N/A|94.37% ─────────────────────────────────────────────────────────────── 2. L3 Read Bandwidth and Hit Rate ─────────────────────────────────────────────────────────────── NODE CCL Read Hit Bandwidth Read Bandwidth Read Hit Rate ─────────────────────────────────────────────────────────────── 0 369.95MB/s 21079.22MB/s 1.76% 1 10.93MB/s 181.49MB/s 6.02% 2 23.75MB/s 296.73MB/s 8.00% 3 4.17MB/s 110.28MB/s 3.78% ────────────────────────────────────────────────────────────────── Memory metrics of the DDRC ────────────────────────────────────────────────────────────────── 1. DDRC_ACCESS_BANDWIDTH Value Format: X|Y = DDR read | DDR write DDRC Read Bandwidth Bottleneck: 12500MB/s (for reference only) Exceeding the bottleneck will significantly increase latency. Please refer to README_ZH.md(Chapter 6.7) for specific bottleneck testing configurations. DDRC exceeding bottleneck: [Node 0, DDRC_2, DDR READ] ────────────────────────────────────────────────────────────────── NODE DDRC_0 DDRC_1 DDRC_2 DDRC_3 Total ────────────────────────────────────────────────────────────────── 0 0.00MB/s|0.00MB/s 0.00MB/s|0.00MB/s 16779.55MB/s|616.21MB/s 0.00MB/s|0.00MB/s 16779.55MB/s|616.21MB/s 1 0.00MB/s|0.00MB/s 0.00MB/s|0.00MB/s 0.00MB/s| 0.00MB/s 7.55MB/s|5.24MB/s 7.55MB/s| 5.24MB/s 2 0.00MB/s|0.00MB/s 0.00MB/s|0.00MB/s 85.35MB/s| 27.01MB/s 0.00MB/s|0.00MB/s 85.35MB/s| 27.01MB/s 3 0.00MB/s|0.00MB/s 0.00MB/s|0.00MB/s 0.00MB/s| 0.00MB/s 27.80MB/s|9.56MB/s 27.80MB/s| 9.56MB/s ──────────────────────────────────────────────────────────────────
Output report description:
The report consists of six parts. From top to bottom are the system information, the average miss rates of L1 and L2 caches, the total double data rate (DDR) bandwidth, the bandwidths and hit rates of L1 and L2 caches, the read bandwidth and hit rate of the L3 cache, and the DDR controller (DDRC) bandwidth.
- System information
Displays the Linux kernel version, CPU type, NUMA nodes, and CPU cores row by row.
- Average miss rates of L1 and L2 caches
Displays the average L1D, L1I, L2D, and L2I cache miss rates of the CPU cores, that is, the ratio of the number of cache misses to the total number of access times.
- Total DDR bandwidth
- L1 and L2 cache bandwidths and hit rates
If you set the -c parameter to specify the CPU cores to be collected, the L1 and L2 cache bandwidths and hit rate of each CPU core are displayed. If you do not specify CPU cores, the average L1 and L2 cache bandwidths and hit rate of all CPU cores are collected by default.
- L3 cache read bandwidth and hit rate
Displays the read hit bandwidth, read bandwidth, and hit rate of the L3 cache on each NUMA node row by row.
- DDRC bandwidth information
Displays the read and write bandwidths of each DDRC. Generally, a NUMA node has four DDRCs.
- Memory access analysis can indicate a group of DDR read bandwidth bottlenecks on a Kunpeng 920 server. You can know whether the current DDRC read bandwidth has reached the bottleneck. If the bottleneck has been reached, the latency between the CPU and DDRC increases significantly. The DDRC configuration of each server is tested based on the standard configuration. The reference DDRC rate of the Kunpeng 920 server is 2933 MT/s.
- In the result, DDRC Bandwidth Bottleneck is the reference bandwidth bottleneck value. If any data in the DDRC_ACCESS_BANDWIDTH table exceeds the bottleneck value, a row of DDRC exceeding bottleneck is added. The data location is specified in Node_n+DDRC_n+read/write type mode.
- When the one numa per socket option is enabled in the
BIOS , each NUMA node has eight DDRC channels. The DDRC bandwidths displayed in the result report correspond to the bandwidths of two DDRCs, which belong to two combined CPU dies (each CPU socket has two CPU dies).