Memory Access Statistics Analysis
Command Function
Accesses the PMU events of the cache and memory and analyzes the number of storage access times, hit rate, and bandwidth.
Syntax
1 | devkit tuner memory [-h] [-d <sec>] [-l {0, 1, 2, 3}] [-i <sec>] [-o] [-m {1, 2, 3, 4}] [-P {100, 1000}] [-c {n,m | n-m}] [--package] |
Parameter Description
Parameter |
Option |
Description |
|---|---|---|
-h/--help |
- |
Obtains help information. |
-d/--duration |
- |
Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. |
-l/--log-level |
0/1/2/3 |
Log level, which defaults to 1.
|
-i/--interval |
- |
Collection interval, in seconds. The minimum value is 1 second and the maximum value cannot exceed the collection duration. The default value is the collection duration. If this parameter is not set, no subreports are generated. It specifies the time taken to collect data in each subreport. |
-m/--metric |
1/2/3/4 |
Sampling type, which defaults to 1.
|
-o/--output |
- |
Report package name and output path. If you enter a name only, the report package is generated in the current directory by default. This option must be used together with --package. |
-c/--cpu |
- |
Number of CPU cores to be collected. The value can be 0 or 0, 1, 2 or 0-2. By default, all CPU cores are collected. |
-P/--period |
100/1000 |
Data collection interval, which defaults to 1000 ms. The options are or 1000 ms or 100 ms. When Collection Duration is set to 1 second, the default value automatically changes to 100 ms. |
--package |
- |
Indicates whether to generate a report data package. If you do not set the package name or path, the memory-timestamp.tar package is generated in the current directory by default. |
Example
1 | devkit tuner memory -d 2 -o /home/memory_result -m 1 --package |
- The -d parameter in this command indicates the collection duration of 2 seconds. The -o /home/memory_result and --package parameters generate a report data package named memory_result to a specified path. The -m 1 parameter collects information about all cache access data, DDR access data, and HBM bandwidth. (HBM bandwidth information is collected only when the environment supports this function.)
Command output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | Memory Summary Report-ALL Time:2024/07/22 15:30:16 ================================================================================ System Information ──────────────────────────────────────────────────────────────────── Linux Kernel Version 4.19.25-203.el7.bclinux.aarch64 Cpu Type Kunpeng 920 NUMA NODE(cpus) 0(0-31) 1(32-63) 2(64-95) 3(96-127) Percentage of core Cache miss ──────────────────────────────────────────────────────────────────── L1D 3.47% L1I 0.01% L2D 58.88% L2I 35.26% DDR Bandwidth ──────────────────────────────────────────────────────────────────── ddrc_write 658.03MB/s ddrc_read 16900.26MB/s Memory metrics of the Cache ──────────────────────────────────────────────────────────────────── 1. L1/L2/TLB Access Bandwidth and Hit Rate Value Format: X|Y = Bandwidth | Hit Rate ───────────────────────────────────────────────────────────────────── CPU L1D L1I L2D L2I L2D_TLB L2I_TLB ──────────────────────────────────────────────────────────────────── all 81581.38MB/s|96.53% 201888.73MB/s|99.99% 35588.65MB/s|41.12% 72.89MB/s|64.74% N/A|57.10% N/A|94.37% ──────────────────────────────────────────────────────────────────── 2. L3 Read Bandwidth and Hit Rate ───────────────────────────────────────────────────────────────── NODE Read Hit Bandwidth Read Bandwidth Read Hit Rate ───────────────────────────────────────────────────────────────── 0 369.95MB/s 21079.22MB/s 1.76% 1 10.93MB/s 181.49MB/s 6.02% 2 23.75MB/s 296.73MB/s 8.00% 3 4.17MB/s 110.28MB/s 3.78% ───────────────────────────────────────────────────────────────── Memory metrics of the DDRC ──────────────────────────────────────────────────────────────────── 1. DDRC_ACCESS_BANDWIDTH Value Format: X|Y = DDR read | DDR write DDRC Read Bandwidth Bottleneck: 12500MB/s (for reference only) Exceeding the bottleneck will significantly increase latency. Please refer to README_ZH.md(Chapter 6.7) for specific bottleneck testing configurations. DDRC exceeding bottleneck: [Node 0, DDRC_2, DDR READ] ──────────────────────────────────────────────────────────────────── NODE DDRC_0 DDRC_1 DDRC_2 DDRC_3 Total ──────────────────────────────────────────────────────────────────── 0 0.00MB/s|0.00MB/s 0.00MB/s|0.00MB/s 16779.55MB/s|616.21MB/s 0.00MB/s|0.00MB/s 16779.55MB/s|616.21MB/s 1 0.00MB/s|0.00MB/s 0.00MB/s|0.00MB/s 0.00MB/s| 0.00MB/s 7.55MB/s|5.24MB/s 7.55MB/s| 5.24MB/s 2 0.00MB/s|0.00MB/s 0.00MB/s|0.00MB/s 85.35MB/s| 27.01MB/s 0.00MB/s|0.00MB/s 85.35MB/s| 27.01MB/s 3 0.00MB/s|0.00MB/s 0.00MB/s|0.00MB/s 0.00MB/s| 0.00MB/s 27.80MB/s|9.56MB/s 27.80MB/s| 9.56MB/s ──────────────────────────────────────────────────────────────────── The report /home/memory_result.tar is generated successfully. To view summary report. you can run: devkit report -i /home/memory_result.tar To view detail report. you can import the report to the WebUI or IDE to view details. |
Output report description:
The report consists of seven parts. From top to bottom are the system information, the average miss rates of L1 and L2 caches, the total double data rate (DDR) bandwidth, the bandwidths and hit rates of L1 and L2 caches, the read bandwidth and hit rate of the L3 cache, and the DDR controller (DDRC) bandwidth.
- System information
Displays the Linux kernel version, CPU type, NUMA nodes, and CPU cores row by row.
- Average miss rates of L1 and L2 caches
Displays the average L1D, L1I, L2D, and L2I cache hit ratio of CPU cores row by row.
- Total DDR bandwidth
- L1 and L2 cache bandwidths and hit rates
If you set the -c parameter to specify the CPU cores to be collected, the L1 and L2 cache bandwidths and CPU hit rate of each CPU core are displayed. If you do not specify CPU cores, the average L1 and L2 cache bandwidths and hit rate of all CPU cores are collected by default.
- L3 cache read bandwidth and hit rate
Displays the read hit bandwidth, read bandwidth, and hit rate of the L3 cache on each NUMA node row by row.
- DDRC bandwidth information
Displays the read and write bandwidths of each DDRC. Generally, a NUMA node has four DDRCs.
- Memory access analysis can indicate a group of DDR read bandwidth bottlenecks on a Kunpeng 920 server. You can know whether the current DDRC read bandwidth has reached the bottleneck. If the bottleneck has been reached, the latency between the CPU and DDRC increases significantly. The DDRC configuration of each server is tested based on the standard configuration. The reference DDRC rate of the Kunpeng 920 server is 2933 MT/s.
- In the result, DDRC Bandwidth Bottleneck is the reference bandwidth bottleneck value. If any data in the DDRC_ACCESS_BANDWIDTH table exceeds the bottleneck value, a row of DDRC exceeding bottleneck is added. The data location is specified in Node_n+DDRC_n+read/write type mode.