Rate This Document
Findability
Accuracy
Completeness
Readability

Memory Access Statistics Analysis

A memory access unit is the most complex logic control unit in the CPU. This unit is responsible for handling various problems in a process of executing memory access instructions such as Load and Store and ensuring high-speed execution. With memory access statistics analysis, you can find those processes that may cause performance problems.

Command Function

Accesses the PMU events of the cache and memory and analyzes the number of storage access times, hit rate, and bandwidth.

Syntax

1
devkit tuner memory [-h] [-d <sec>] [-l {0, 1, 2, 3}] [-i <sec>] [-m {1, 2, 3, 4}] [-P {100, 1000}] [-c {n,m | n-m}]

Parameter Description

Table 1 Parameter description

Parameter

Option

Description

-h/--help

-

Obtains help information. This parameter is optional.

-d/--duration

-

Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. This parameter is optional.

-l/--log-level

0/1/2/3

Log level, which defaults to 1. This parameter is optional.
  • 0: DEBUG
  • 1: INFO
  • 2: WARNING
  • 3: ERROR

-i/--interval

-

Task collection interval, in seconds. This parameter is optional. The minimum value is 1 second and the maximum value cannot exceed the collection duration. The default value is the collection duration. If this parameter is not set, no subreports are generated. It specifies the time taken to collect data in each subreport.

-m/--metric

1/2/3/4

Sampling type, which defaults to 1. This parameter is optional.

  • 1 (ALL)
  • 2 (Cache)
  • 3 (DDR)
  • 4 (HBM)
    NOTE:

    The 4 (HBM) option displays the HBM bandwidth information. This option is available to openEuler 22.03 SP3 or later, and requires hardware support.

-P/--period

100/1000

Actual data collection interval, which defaults to 1000 ms. This parameter is optional. The options are or 1000 ms or 100 ms. When Collection Duration is set to 1 second, the default value automatically changes to 100 ms.

-c/--cpu

-

Number of CPU cores to be collected. The value can be 0 or 0, 1, 2 or 0-2. This parameter is optional. By default, all CPU cores are collected.

Example

1
devkit tuner memory -d 2 -m 1

In the command, -d indicates that the collection duration is 2 seconds, and -m 1 indicates that all the cache access, DDR access, and HBM bandwidth information is collected. (The HBM bandwidth information is displayed only when the environment supports HBM information collection.)

Command output:

Memory Summary Report-ALL                               Time:2025/12/09 09:57:23
================================================================================

System Information
───────────────────────────────────────────────────────────────
Linux Kernel Version        4.19.25-203.el7.bclinux.aarch64
Cpu Type                    Kunpeng 920
NUMA NODE(cpus)             0(0-31)      1(32-63)     2(64-95)     3(96-127)

Percentage of core Cache miss
───────────────────────────────────────────────────────────────
L1D         3.47%
L1I         0.01%
L2D        58.88%
L2I        35.26%


DDR Bandwidth (system wide)
───────────────────────────────────────────────────────────────
ddrc_write        658.03MB/s
ddrc_read         16900.26MB/s


Memory metrics of the Cache
───────────────────────────────────────────────────────────────
1. L1/L2/TLB Access Bandwidth and Hit Rate
Value Format: X|Y = Bandwidth | Hit Rate
───────────────────────────────────────────────────────────────
  CPU                   L1D                    L1I                   L2D                  L2I       L2D_TLB       L2I_TLB
───────────────────────────────────────────────────────────────
  all    81581.38MB/s|96.53%    201888.73MB/s|99.99%    35588.65MB/s|41.12%    72.89MB/s|64.74%    N/A|57.10%    N/A|94.37%
───────────────────────────────────────────────────────────────


2. L3 Read Bandwidth and Hit Rate
───────────────────────────────────────────────────────────────
  NODE    CCL     Read Hit Bandwidth    Read Bandwidth    Read Hit Rate
───────────────────────────────────────────────────────────────
  0               369.95MB/s      21079.22MB/s            1.76%
  1                10.93MB/s        181.49MB/s            6.02%
  2                23.75MB/s        296.73MB/s            8.00%
  3                 4.17MB/s        110.28MB/s            3.78%
──────────────────────────────────────────────────────────────────


Memory metrics of the DDRC
──────────────────────────────────────────────────────────────────
1. DDRC_ACCESS_BANDWIDTH
Value Format: X|Y = DDR read | DDR write
DDRC Read Bandwidth Bottleneck: 12500MB/s (for reference only)
Exceeding the bottleneck will significantly increase latency.
Please refer to README_ZH.md(Chapter 6.7) for specific bottleneck testing configurations.
DDRC exceeding bottleneck: [Node 0, DDRC_2, DDR READ]
──────────────────────────────────────────────────────────────────
  NODE                   DDRC_0                   DDRC_1                   DDRC_2                   DDRC_3                    Total
──────────────────────────────────────────────────────────────────
  0       0.00MB/s|0.00MB/s    0.00MB/s|0.00MB/s    16779.55MB/s|616.21MB/s     0.00MB/s|0.00MB/s    16779.55MB/s|616.21MB/s
  1       0.00MB/s|0.00MB/s    0.00MB/s|0.00MB/s        0.00MB/s|  0.00MB/s     7.55MB/s|5.24MB/s        7.55MB/s|  5.24MB/s
  2       0.00MB/s|0.00MB/s    0.00MB/s|0.00MB/s       85.35MB/s| 27.01MB/s     0.00MB/s|0.00MB/s       85.35MB/s| 27.01MB/s
  3       0.00MB/s|0.00MB/s    0.00MB/s|0.00MB/s        0.00MB/s|  0.00MB/s    27.80MB/s|9.56MB/s       27.80MB/s|  9.56MB/s
──────────────────────────────────────────────────────────────────

Output report description:

The report consists of six parts. From top to bottom are the system information, the average miss rates of L1 and L2 caches, the total double data rate (DDR) bandwidth, the bandwidths and hit rates of L1 and L2 caches, the read bandwidth and hit rate of the L3 cache, and the DDR controller (DDRC) bandwidth.

  1. System information

    Displays the Linux kernel version, CPU type, NUMA nodes, and CPU cores row by row.

  2. Average miss rates of L1 and L2 caches

    Displays the average L1D, L1I, L2D, and L2I cache miss rates of the CPU cores, that is, the ratio of the number of cache misses to the total number of access times.

  3. Total DDR bandwidth

    Displays the DDRC read and write bandwidths row by row.

  4. L1 and L2 cache bandwidths and hit rates

    If you set the -c parameter to specify the CPU cores to be collected, the L1 and L2 cache bandwidths and hit rate of each CPU core are displayed. If you do not specify CPU cores, the average L1 and L2 cache bandwidths and hit rate of all CPU cores are collected by default.

  5. L3 cache read bandwidth and hit rate

    Displays the read hit bandwidth, read bandwidth, and hit rate of the L3 cache on each NUMA node row by row.

  6. DDRC bandwidth information

    Displays the read and write bandwidths of each DDRC. Generally, a NUMA node has four DDRCs.

    • Memory access analysis can indicate a group of DDR read bandwidth bottlenecks on a Kunpeng 920 server. You can know whether the current DDRC read bandwidth has reached the bottleneck. If the bottleneck has been reached, the latency between the CPU and DDRC increases significantly. The DDRC configuration of each server is tested based on the standard configuration. The reference DDRC rate of the Kunpeng 920 server is 2933 MT/s.
    • In the result, DDRC Bandwidth Bottleneck is the reference bandwidth bottleneck value. If any data in the DDRC_ACCESS_BANDWIDTH table exceeds the bottleneck value, a row of DDRC exceeding bottleneck is added. The data location is specified in Node_n+DDRC_n+read/write type mode.
    • When the one numa per socket option is enabled in the BIOS, each NUMA node has eight DDRC channels. The DDRC bandwidths displayed in the result report correspond to the bandwidths of two DDRCs, which belong to two combined CPU dies (each CPU socket has two CPU dies).