Rate This Document
Findability
Accuracy
Completeness
Readability

Overview

The System Profiler is a performance analysis tool for Kunpeng-powered servers. It collects performance data of processor hardware, OSs, processes, threads, and functions, analyzes system performance metrics, locates system bottlenecks and hotspot functions, and provides tuning suggestions. This tool helps you quickly locate and handle software performance problems.

Figure 1 System Profiler
Table 1 Task description

Task Type

Task Subtype

Description

General analysis

Hotspot function analysis

The tool analyzes C/C++ program code, identifies performance bottlenecks, and displays hotspot functions. It also displays the function call relationship in flame graphs and provides the tuning path.

System component analysis

NUMA refined analysis

This analysis is based on the Arm Statistical Profiling Extension ( SPE ) capability. SPE samples instructions and records information about triggered events, including accurate PC pointer information. The tool leverages the SPE feature to collect the NUMA performance of all processes in the system, find the top N (for example, N = 10) processes with the poorest NUMA performance and the hotspot memory areas of these processes, and identify the inter-NUMA-node memory access statistics matrix and the inter-node memory access imbalance status. Then related tuning suggestions are provided.

I/O analysis

The tool analyzes the storage I/O performance. By analyzing block storage devices, the tool obtains performance data such as the number of I/O operations, I/O data size, I/O queue depth, and I/O operation latency, and identifies specific I/O operations, processes, threads, call stacks, and I/O APIs in the application layer. Based on the I/O performance data, the tool provides tuning suggestions.

Specified analysis

Lock and wait analysis

The tool analyzes the lock and wait functions (including sleep, usleep, mutex, cond, spinlock, rwlock, and semaphore) of glibc and open source software, such as MySQL and OpenMP, associates the processes and call sites to which the lock and wait functions belong, and provides tuning suggestions based on existing experience.

HPC application analysis

The tool collects Performance Monitor Unit (PMU) events of the system and the key metrics of MPI and MPI+OpenMP applications to help accurately obtain the serial and parallel time of the parallel region and barrier-to-barrier, calibrated 2-layer microarchitecture metrics, instruction distribution, L3 usage, and memory bandwidth.

Comparative analysis

-

For the same type of analysis tasks, you can select the same node or different nodes to compare the analysis results. In this way, you can quickly learn the differences between different analysis results, locate performance metric changes, and identify the effect of optimization methods.