NUMA Refined Analysis
Command Function
Obtains the refined DDR access,
- To collect the refined NUMA data, the server must support Arm SPE collection. For details about how to configure SPE, see Configuring the SPE Environment.
- You can import the tasks for which TAR packages have been generated to the WebUI for visual display. For details, see the task import content in Task Management.
Syntax
devkit tuner numafast [-d <DURATION> | --duration=DURATION] [-i <INTERVAL> | --interval=INTERVAL]
Parameter Description
Parameter |
Option |
Description |
|---|---|---|
-h/--help |
- |
Obtains help information. |
-o/--outpath |
- |
Report file name. Reports are generated in the current directory. |
-l/--log-level |
0/1/2/3 |
Log level, which defaults to 1.
|
-d/--duration |
- |
Collection duration, in seconds. The value ranges from 2 to 172,800 seconds. Collection never ends by default. You can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection and start analysis. |
-i/--interval |
- |
Collection interval, which defaults to 5 seconds. The value ranges from 2 to 30 seconds. |
-c/--count |
- |
Instruction collection interval for SPE, which defaults to 2048. The value ranges from 1 to 4,294,967,295. |
-n/--num |
- |
Number of top N processes to be displayed, which defaults to 10. The value ranges from 1 to 30. |
--package |
- |
Indicates whether to import data to the database and generate compressed packages in the specified output path. |
-f/ --file |
- |
Generates only report files but not report data packages. This parameter is used with --package. |
Example
Command:
devkit tuner numafast -i 2 -c 2048 -n 3 --package
- In this command, the sampling interval is 2 seconds, the instruction collection interval for SPE is 2048, the top 3 processes are displayed, and a report data package is generated in the default path.
- If the -d parameter is not set, you can press Ctrl+\ to cancel the task or press Ctrl+C to stop the collection. If -d is set, the number of generated reports is affected by the -i parameter. For example, if -d 10 -i 2 is set, it means that the collection duration is 10 seconds and the sampling interval is 2 seconds, as a result, five reports are generated.
Command output:
Enter analyze mode, please wait 5 seconds...
NUMAFAST ANALYSIS(Press Ctrl+C to exit)
==========================================================================================
1. System's numa score : 0.88
Note: score = (max cost - real cost) / (max cost - min cost)
real cost = SUM(0<=i,j<node number) numa distance(i, j) * access percentage(i, j)
max cost = MAX(numa distance) , min cost = MIN(numa distance).
This score is best at 1 and worst at 0.
Format: traffic | numa distance | access percentage.
DST_0 DST_1 DST_2 DST_3
SRC_0 0.63GB|10|38.24% 1.03GB|12|23.53% 0.10GB|20|8.82% 0.12GB|22|2.94%
SRC_1 0.00GB|12|0.00% 1.16GB|10|26.47% 0.00GB|22|0.00% 0.00GB|24|0.00%
SRC_2 0.00GB|20|0.00% 0.00GB|22|0.00% 0.00GB|10|0.00% 0.00GB|12|0.00%
SRC_3 0.00GB|22|0.00% 0.00GB|24|0.00% 0.00GB|12|0.00% 0.00GB|10|0.00%
==========================================================================================
2. Node detail information of memory access traffic:
Note:RMA(Die): Access traffic across NUMA dies.
RMA(Socket): Access traffic across NUMA sockets.
LMA: Local access traffic on the NUMA node.
%CPU: Number of occupied CPU cores. For example, 600% indicates that 6 CPU cores
are occupied.
NID RMA(Die) RMA(Skt) LMA %RMA MEM(all) MEM(free) %MEM %CPU
0 1.03GB 0.22GB 0.63GB 66.5 63.21GB 0.27GB 99.6 137.1
1 0.00GB 0.00GB 1.16GB 0.0 63.93GB 0.77GB 98.8 93.2
2 0.00GB 0.00GB 0.00GB 0.0 63.93GB 10.69GB 83.3 96.4
3 0.00GB 0.00GB 0.00GB 0.0 62.93GB 48.75GB 22.5 109.6
==========================================================================================
3. Show top 3 process which sorted by memory access:
Note:If the collected processes less than the number specified by -n (--num), only the actual processes are displayed.
MIGRATED X|Y: X indicates how many times threads of the process are migrated between
NUMA nodes, and Y indicates the number of threads in the process.
ACCESS: Percentage of the process access traffic to the total traffic.Top N
sorting is based on this.
PID SCORE ACCESS RMA(Die) RMA(Skt) LMA %RMA MIGRATED %CPU COMMAND
2083500 1.00 38.24% 0.00GB 0.00GB 0.63GB 0.0 0|2 nan test_thread
3296840 0.97 32.35% 0.26GB 0.00GB 1.16GB 18.2 0|2 nan python
3784179 0.61 29.41% 0.77GB 0.22GB 0.00GB 100.0 0|1 nan gunicorn
==========================================================================================- Output ReportThe report consists of three parts: memory access matrix information, node details of memory access traffic, and process information sorted by memory access.
- Memory access matrix information
The data consists of three parts: bandwidth traffic from SRC to DST, number of NUMA switchovers from SRC to DST, and the proportion of the traffic from SRC to DST to the total traffic.
- Node details of memory access traffic
Table 2 Parameters of node details Parameter
Description
NID
NUMA node ID.
RMA(Die)
Cross-NUMA access traffic.
RMA(Skt)
Cross-NUMA socket access traffic.
LMA
Local access traffic on the NUMA node.
%RMA
Percentage of remote access traffic.
MEM(all)
Total memory size.
MEM(free)
Available memory size.
%MEM
Memory usage.
%CPU
Number of occupied CPU cores. For example, 600% indicates that 6 CPU cores are occupied.
- Top N processes sorted by memory access
Table 3 Parameters of process information Parameter
Description
PID
Process ID.
SCORE
NUMA score.
ACCESS
Percentage of the process access traffic to the total traffic (determines the top N sorting).
RMA(Die)
Cross-NUMA access traffic.
RMA(Skt)
Cross-NUMA socket access traffic.
LMA
Local access traffic on the NUMA node.
%RMA
Percentage of remote access traffic.
MIGRATED
Number of times that threads are migrated between NUMA nodes and number of threads in a process.
%CPU
CPU usage.
COMMAND
Command line of a process.
- Memory access matrix information