Rate This Document
Findability
Accuracy
Completeness
Readability

Memory Access Statistics Analysis

Command Function

Accesses the PMU events of the cache and memory and analyzes the number of storage access times, hit rate, and bandwidth.

Syntax

1
devkit tuner memory [-h] [-d <sec>] [-l {0, 1, 2, 3}] [-i <sec>] [-o] [-m {1, 2, 3, 4}] [-P {100, 1000}] [-c {n,m | n-m}] [--package]

Parameter Description

Table 1 Parameter description

Parameter

Option

Description

-h/--help

-

Obtains help information.

-d/--duration

-

Collection duration, in seconds. The minimum value is 1 second. By default collection never ends. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis.

-l/--log-level

0/1/2/3

Log level, which defaults to 1.
  • 0: DEBUG
  • 1: INFO
  • 2: WARNING
  • 3: ERROR

-i/--interval

-

Collection interval, in seconds. The minimum value is 1 second and the maximum value cannot exceed the collection duration. The default value is the collection duration. If this parameter is not set, no subreports are generated. It specifies the time taken to collect data in each subreport.

-m/--metric

1/2/3/4

Sampling type, which defaults to 1.

  • 1 (ALL)
  • 2 (Cache)
  • 3 (DDR)
  • 4 (HBM)
    NOTE:

    The 4 (HBM) option displays the HBM bandwidth information. This option is available to openEuler 22.03 SP3 or later, and requires a certain running environment.

-o/--output

-

Report package name and output path. If you enter a name only, the report package is generated in the current directory by default. This option must be used together with --package.

-c/--cpu

-

Number of CPU cores to be collected. The value can be 0 or 0, 1, 2 or 0-2. By default, all CPU cores are collected.

-P/--period

100/1000

Data collection interval, which defaults to 1000 ms. The options are or 1000 ms or 100 ms. When Collection Duration is set to 1 second, the default value automatically changes to 100 ms.

--package

-

Indicates whether to generate a report data package. If you do not set the package name or path, the memory-timestamp.tar package is generated in the current directory by default.

Example

1
devkit tuner memory -d 2 -o /home/memory_result -m 1 --package
  • The -d parameter in this command indicates the collection duration of 2 seconds. The -o /home/memory_result and --package parameters generate a report data package named memory_result to a specified path. The -m 1 parameter collects information about all cache access data, DDR access data, and HBM bandwidth. (HBM bandwidth information is collected only when the environment supports this function.)

Command output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
Memory Summary Report-ALL                               Time:2024/07/22 15:30:16
================================================================================

System Information
────────────────────────────────────────────────────────────────────
Linux Kernel Version        4.19.25-203.el7.bclinux.aarch64
Cpu Type                    Kunpeng 920
NUMA NODE(cpus)             0(0-31)      1(32-63)     2(64-95)     3(96-127)


Percentage of core Cache miss
────────────────────────────────────────────────────────────────────
L1D         3.47%
L1I         0.01%
L2D        58.88%
L2I        35.26%


DDR Bandwidth
────────────────────────────────────────────────────────────────────
ddrc_write          658.03MB/s
ddrc_read         16900.26MB/s


Memory metrics of the Cache
────────────────────────────────────────────────────────────────────
1. L1/L2/TLB Access Bandwidth and Hit Rate
Value Format: X|Y = Bandwidth | Hit Rate
─────────────────────────────────────────────────────────────────────  CPU                    L1D                     L1I                    L2D                 L2I       L2D_TLB       L2I_TLB
────────────────────────────────────────────────────────────────────
  all    81581.38MB/s|96.53%    201888.73MB/s|99.99%    35588.65MB/s|41.12%    72.89MB/s|64.74%    N/A|57.10%    N/A|94.37%
────────────────────────────────────────────────────────────────────
2. L3 Read Bandwidth and Hit Rate
─────────────────────────────────────────────────────────────────
  NODE    Read Hit Bandwidth    Read Bandwidth    Read Hit Rate
─────────────────────────────────────────────────────────────────
  0               369.95MB/s      21079.22MB/s            1.76%
  1                10.93MB/s        181.49MB/s            6.02%
  2                23.75MB/s        296.73MB/s            8.00%
  3                 4.17MB/s        110.28MB/s            3.78%
─────────────────────────────────────────────────────────────────

Memory metrics of the DDRC
────────────────────────────────────────────────────────────────────
1. DDRC_ACCESS_BANDWIDTH
Value Format: X|Y = DDR read | DDR write
DDRC Read Bandwidth Bottleneck: 12500MB/s (for reference only)
Exceeding the bottleneck will significantly increase latency.
Please refer to README_ZH.md(Chapter 6.7) for specific bottleneck testing configurations.
DDRC exceeding bottleneck: [Node 0, DDRC_2, DDR READ]
────────────────────────────────────────────────────────────────────
  NODE               DDRC_0               DDRC_1                     DDRC_2                DDRC_3                      Total
────────────────────────────────────────────────────────────────────
  0       0.00MB/s|0.00MB/s    0.00MB/s|0.00MB/s    16779.55MB/s|616.21MB/s     0.00MB/s|0.00MB/s    16779.55MB/s|616.21MB/s
  1       0.00MB/s|0.00MB/s    0.00MB/s|0.00MB/s        0.00MB/s|  0.00MB/s     7.55MB/s|5.24MB/s        7.55MB/s|  5.24MB/s
  2       0.00MB/s|0.00MB/s    0.00MB/s|0.00MB/s       85.35MB/s| 27.01MB/s     0.00MB/s|0.00MB/s       85.35MB/s| 27.01MB/s
  3       0.00MB/s|0.00MB/s    0.00MB/s|0.00MB/s        0.00MB/s|  0.00MB/s    27.80MB/s|9.56MB/s       27.80MB/s|  9.56MB/s
────────────────────────────────────────────────────────────────────
The report /home/memory_result.tar is generated successfully.
To view summary report. you can run: devkit report -i /home/memory_result.tar
To view detail report. you can import the report to the WebUI or IDE to view details.

Output report description:

The report consists of seven parts. From top to bottom are the system information, the average miss rates of L1 and L2 caches, the total double data rate (DDR) bandwidth, the bandwidths and hit rates of L1 and L2 caches, the read bandwidth and hit rate of the L3 cache, and the DDR controller (DDRC) bandwidth.

  1. System information

    Displays the Linux kernel version, CPU type, NUMA nodes, and CPU cores row by row.

  2. Average miss rates of L1 and L2 caches

    Displays the average L1D, L1I, L2D, and L2I cache hit ratio of CPU cores row by row.

  3. Total DDR bandwidth

    Displays the DDRC read and write bandwidths row by row.

  4. L1 and L2 cache bandwidths and hit rates

    If you set the -c parameter to specify the CPU cores to be collected, the L1 and L2 cache bandwidths and CPU hit rate of each CPU core are displayed. If you do not specify CPU cores, the average L1 and L2 cache bandwidths and hit rate of all CPU cores are collected by default.

  5. L3 cache read bandwidth and hit rate

    Displays the read hit bandwidth, read bandwidth, and hit rate of the L3 cache on each NUMA node row by row.

  6. DDRC bandwidth information

    Displays the read and write bandwidths of each DDRC. Generally, a NUMA node has four DDRCs.

    • Memory access analysis can indicate a group of DDR read bandwidth bottlenecks on a Kunpeng 920 server. You can know whether the current DDRC read bandwidth has reached the bottleneck. If the bottleneck has been reached, the latency between the CPU and DDRC increases significantly. The DDRC configuration of each server is tested based on the standard configuration. The reference DDRC rate of the Kunpeng 920 server is 2933 MT/s.
    • In the result, DDRC Bandwidth Bottleneck is the reference bandwidth bottleneck value. If any data in the DDRC_ACCESS_BANDWIDTH table exceeds the bottleneck value, a row of DDRC exceeding bottleneck is added. The data location is specified in Node_n+DDRC_n+read/write type mode.