我要评分
获取效率
正确性
完整性
易理解

NUMA Refined Analysis

By analyzing DDR access data, inter-NUMA access traffic matrix, and other data, you can find the bandwidth traffic or threads/processes that may have problems, and further locate performance problems caused by cross-CPU memory access.

Command Function

Obtains the refined DDR access, NUMA access bandwidth matrix, and processes' memory access information based on Arm SPE capabilities.

To collect the refined NUMA data, the server must support Arm SPE collection. For details about how to configure SPE, see Configuring the SPE Environment.

Syntax

1
devkit tuner numafast [-d <DURATION> | --duration=DURATION] [-i <INTERVAL> | --interval=INTERVAL]

Parameter Description

Table 1 Parameter description

Parameter

Option

Description

-h/--help

-

Obtains help information. This parameter is optional.

-o/--outpath

-

Report package name and output path. If you enter a name only, the report package is generated in the current directory by default. This option must be used together with --package. This parameter is optional.

NOTE:

You can import the tasks for which TAR packages have been generated to the WebUI for visual display. For details, see the task import content in Task Management.

-l/--log-level

0/1/2/3

Log level, which defaults to 1. This parameter is optional.
  • 0: DEBUG
  • 1: INFO
  • 2: WARNING
  • 3: ERROR

-d/--duration

-

Collection duration, in seconds. The value ranges from 2 to 172,800 seconds. Collection never ends by default. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. This parameter is optional.

-i/--interval

-

Collection interval, which defaults to 5 seconds. The value ranges from 2 to 30 seconds. This parameter is optional.

-c/--count

-

Instruction collection interval for SPE, which defaults to 2048. The value ranges from 1 to 4,294,967,295. This parameter is optional.

-n/--num

-

Number of top N processes to be displayed, which defaults to 10 and ranges from 1 to 30. The report results are displayed in descending order of processes' access traffic. If the actual number of processed collected using SPE is less than the value of N, the actual number is used. This parameter is optional.

-t/--threads

-

Number of top N threads to be displayed, which defaults to 5 and ranges from 1 to 10. The report results are displayed in descending order of threads' access traffic. The displayed number of threads is the same as that of processes. This parameter is optional.

--package

-

Indicates whether to generate a report data package. If you do not set the package name or path, the numafast-Timestamp.tar package is generated in the current directory by default. This parameter is optional.

-f/--file

-

Generates only report data but not a report data package. This parameter is optional. This parameter is used with --package.

Example

1
devkit tuner numafast -d 10 -i 2 -c 2048 -n 3 -t 3 --package
  • In this command, the sampling interval is 2 seconds, the instruction collection interval for SPE is 2048, the top 3 processes are displayed, and a report data package is generated in the default path.
  • If the -d parameter is not set, you can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection. If -d is set, the number of generated reports is affected by the -i parameter. For example, if -d 10 -i 2 is set, it means that the collection duration is 10 seconds and the sampling interval is 2 seconds, as a result, five reports are generated.
  • In the example report output, only one report is displayed. The actual report result prevails.

Command output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Enter analyze mode, please wait 5 seconds...

NUMAFAST ANALYSIS(Press Ctrl+C to exit)
==========================================================================================
1. System's numa score : 0.88
   Note: score = (max cost - real cost) / (max cost - min cost)
         real cost = SUM(0<=i,j<node number) numa distance(i, j) * access percentage(i, j)
         max cost = MAX(numa distance) , min cost = MIN(numa distance).
         This score is best at 1 and worst at 0.
         Format: traffic | numa distance | access percentage.

              DST_0               DST_1               DST_2               DST_3
SRC_0   0.63GB|10|38.24%    1.03GB|12|23.53%    0.10GB|20|8.82%     0.12GB|22|2.94%
SRC_1   0.00GB|12|0.00%     1.16GB|10|26.47%    0.00GB|22|0.00%     0.00GB|24|0.00%
SRC_2   0.00GB|20|0.00%     0.00GB|22|0.00%     0.00GB|10|0.00%     0.00GB|12|0.00%
SRC_3   0.00GB|22|0.00%     0.00GB|24|0.00%     0.00GB|12|0.00%     0.00GB|10|0.00%

==========================================================================================
2. Node detail information of memory access traffic:
   Note:RMA(Die): Access traffic across NUMA dies.
        RMA(Socket): Access traffic across NUMA sockets.
        LMA: Local access traffic on the NUMA node.
        %CPU: Number of occupied CPU cores. For example, 600% indicates that 6 CPU cores
        are occupied.

 NID RMA(Die) RMA(Skt)      LMA  %RMA MEM(all) MEM(free) %MEM   %CPU
   0   1.03GB   0.22GB   0.63GB  66.5  63.21GB    0.27GB 99.6  137.1
   1   0.00GB   0.00GB   1.16GB   0.0  63.93GB    0.77GB 98.8   93.2
   2   0.00GB   0.00GB   0.00GB   0.0  63.93GB   10.69GB 83.3   96.4
   3   0.00GB   0.00GB   0.00GB   0.0  62.93GB   48.75GB 22.5  109.6

==========================================================================================
3. Show top 3 processes and top 3 threads which sorted by memory access:
   Note:
         If the collected processes less than the number specified by -n (--num), only
         the actual processes are displayed.
         ACCESS: Percentage of the process access traffic to the total traffic. Top N
         sorting is based on this. Threads are the same.
         MIGRATED X|Y: X indicates how many times threads of the process are migrated between
         NUMA nodes, and Y indicates the number of threads in the process.
         %CPU: The meaning is the same as that of the node data, but the data of the first
         report is not included.

 PID(TID)  SCORE  ACCESS  RMA_Die  RMA_Skt      LMA    %RMA  MIGRATED    %CPU    COMMAND
 159706     0.98  86.19%   0.00GB   0.02GB   0.69GB    2.30    0|1       5.00    gunicorn
└─201754  0.98  86.19%   0.00GB   0.02GB   0.69GB    2.30    -|-       1.00    gunicorn

 200956     0.76  11.05%   0.04GB   0.02GB   0.03GB   68.12    0|2       0.50    gunicorn
├─254073  0.73   8.14%   0.02GB   0.02GB   0.03GB   56.71    -|-       0.00    gunicorn
└─254062  0.86   2.91%   0.02GB   0.00GB   0.00GB  100.00    -|-       0.00    gunicorn

 159681     0.86   1.98%   0.02GB   0.00GB   0.00GB  100.00    0|1       0.00    gunicorn
└─159681  0.86   1.98%   0.02GB   0.00GB   0.00GB  100.00    -|-       0.00    gunicorn

==========================================================================================
The report /root/numafast-20241121-191339.tar is generated successfully
To view the summary report, you can run: devkit report -i /root/numafast-20241121-191339.tar
To view the detail report, you can import the report to WebUI or IDE

Output report description:

The report consists of three parts: memory access matrix information, node details of memory access traffic, and process information sorted by memory access.
  1. Memory access matrix information

    The data consists of three parts: bandwidth traffic from SRC to DST, number of NUMA switchovers from SRC to DST, and the proportion of the traffic from SRC to DST to the total traffic.

  2. Node details of memory access traffic
    Table 2 Parameters of node details

    Parameter

    Description

    NID

    NUMA node ID.

    RMA(Die)

    Cross-NUMA access traffic.

    RMA(Skt)

    Cross-chip access traffic.

    LMA

    Local access traffic on the NUMA node.

    %RMA

    Percentage of remote access traffic.

    MEM(all)

    Total memory size.

    MEM(free)

    Available memory size.

    %MEM

    Memory usage.

    %CPU

    Number of occupied CPU cores. For example, 600% indicates that 6 CPU cores are occupied.

  3. Top N processes sorted by memory access
    Table 3 Parameters of process information

    Parameter

    Description

    PID(TID)

    Process/Thread ID.

    SCORE

    NUMA score.

    ACCESS

    Percentage of the process/thread access traffic to the total traffic (determines the top N sorting).

    RMA(Die)

    Cross-NUMA access traffic.

    RMA(Skt)

    Cross-chip access traffic.

    LMA

    Local access traffic on the NUMA node.

    %RMA

    Percentage of remote access traffic.

    MIGRATED

    Number of times that threads are migrated between NUMA nodes and number of threads in a process.

    %CPU

    CPU usage.

    COMMAND

    Command line of a process/thread.