Rate This Document
Findability
Accuracy
Completeness
Readability

Introduction

The System Profiler is a performance analysis tool for Kunpeng-powered servers. It collects performance data of processor hardware, operating system (OS), processes, threads, and functions, analyzes system performance metrics, locates system bottlenecks and hotspot functions, and provides tuning suggestions. This tool helps quickly locate and handle software performance problems.

Table 1 Task description

Task Type

Description

Comparison analysis

For the same type of analysis tasks, you can select the same node or different nodes to compare the analysis results. In this way, you can quickly learn the differences between different analysis results, locate performance metric changes, and identify the effect of optimization methods.

HPC cluster check

The tool checks the hardware and software configurations of a specified MPI cluster and provides a report on software and hardware configuration consistency between nodes in the cluster. The configuration items include CPUs, GPUs, interconnection, memory, NICs, drives, OS, kernel, environment variables, MPI/OpenMP, and common HPC dependency libraries. The tool gives tuning suggestions on configurations that do not comply with the best practices of the Kunpeng platform.

HPC application analysis

The tool collects Performance Monitor Unit (PMU) events of the system and the key metrics of OpenMP and MPI applications to help accurately obtain the serial and parallel time of the parallel region and barrier-to-barrier, calibrated L2 microarchitecture metrics, instruction distribution, L3 usage, and memory bandwidth.

Overall analysis

The tool collects the software and hardware configuration information of the entire system and the running status of system resources, such as CPUs, memory, storage I/O, and network I/O, to obtain performance metrics such as usage, saturation, and errors. These metrics help identify performance bottlenecks. The tool also provides performance tuning suggestions for some metrics based on the benchmark data and experience.

The tool checks the hardware, system, and component configurations in big data, database, and distributed storage scenarios, displays configuration items that are not optimal, and analyzes and provides typical hardware configuration and software version information.

Microarchitecture analysis

The tool obtains the running status of instructions on the CPU pipeline based on Arm Performance Monitor Unit (PMU) events. It helps quickly locate the performance bottleneck of the current application on the CPU and modify the programs to maximize the utilization of hardware resources.

Memory access analysis

By analyzing the events related to the CPU's access to the cache and memory, the tool identifies potential performance bottlenecks on memory access, locates the possible causes, and provides tuning suggestions.
  • Memory access statistics analysis

    Based on the PMU events related to the processor's access to the cache and memory, the tool analyzes the number of access operations and hit rate, including:

    • Access hit rate and bandwidth of the L1C, L2C, L3C, and TLB.
    • HHA access rate
    • DDR access bandwidth and access operations
  • Miss event analysis

    This analysis is based on the Arm Statistical Profiling Extension (SPE) capability. SPE samples instructions and records information about triggered events, including accurate PC pointer information. This capability can be used to analyze miss events, such as LLC misses, TLB misses, remote access, and long latency loads, and accurately associate the code that causes the events. Based on the information, you can modify your programs to reduce the probability of certain events and improve performance.

  • NUMA refined analysis

    This analysis is based on the Arm SPE capability. SPE samples instructions and records information about triggered events, including accurate PC pointer information. The tool leverages the SPE capability to collect the NUMA performance of all processes in the system, find the top N (for example, N = 10) processes with the poorest NUMA performance and the hotspot memory areas of these processes, and identify the inter-NUMA node memory access statistics matrix and the inter-node memory access imbalance status. Then related tuning suggestions are provided.

I/O analysis

The tool analyzes storage I/O performance. By analyzing block storage devices, the tool obtains performance data such as the number of I/O operations, I/O data size, I/O queue depth, and I/O operation delay, and identifies specific I/O operations, processes, threads, call stacks, and I/O APIs in the application layer. Based on the I/O performance data, the tool provides tuning suggestions.

Resource scheduling analysis

The tool collects the running status of processes and threads to obtain metrics such as the cold flame graph, number of link switchovers, and global proportions, and identifies performance bottlenecks based on the metrics. The system call status of a single process can be analyzed.

Hotspot function analysis

The tool analyzes C/C++ program code, identifies performance bottlenecks, and displays hotspot functions. It also displays the function call relationship in flame graphs and provides the tuning path.

Lock and wait analysis

The tool analyzes the lock and wait functions (including sleep, usleep, mutex, cond, spinlock, rwlock, and semaphore) of glibc and open source software, such as MySQL and OpenMP, associates the processes and call sites to which the lock and wait functions belong, and provides tuning suggestions based on existing experience.

Roofline analysis

Helps pinpoint application bottlenecks on a given hardware platform and optimize an application accordingly.

Use Restrictions

Table 2 Use restrictions

Task Type

Description

Comparative analysis

Overall analysis, hotspot function analysis, and roofline analysis are supported.

HPC cluster check

Password-free login must be enabled for each node in the MPI cluster. For configuration items that do not comply with the best practice of the Kunpeng platform, the tool provides optimization suggestions.

HPC application analysis

During OpenMP data collection, the kernel parameters /proc/sys/kernel/kptr_restrict and /proc/sys/kernel/perf_event_paranoid are enabled to collect call graph data and PMU events. After the collection is complete, the two kernel parameters are restored to their original values.

Microarchitecture analysis

You must have the root permission to perform the following operations.

  1. If the configuration of Paranoid is incorrect, set the Paranoid variable to -1. For example, in CentOS and openEuler, run the following command:
    1
    echo -1 > /proc/sys/kernel/perf_event_paranoid
    
  2. If a message is displayed indicating that data collection fails and the OS performance monitor is not enabled, run the following command to enable it:
    1
    echo 0 > /proc/sys/kernel/nmi_watchdog
    

Memory access analysis

This function is available on openEuler and CentOS 7.6 with the Statistical Profiling Extension (SPE) feature. The supported openEuler kernel versions are 4.19 and later and the supported CentOS 7.6 kernel versions are 4.14.0-115.el7a.0.1, 4.14.0-115.2.2.el7a, 4.14.0-115.5.1.el7a, 4.14.0-115.6.1.el7a, 4.14.0-115.7.1.el7a, 4.14.0-115.8.2.el7a, and 4.14.0-115.10.1.el7a.

  • Miss event analysis

    Miss event analysis is not supported in VM and container environments.

  • NUMA refined analysis

    NUMA refined analysis is not supported in the VM environment.

I/O analysis

The system kernel supports ftrace collection.

Resource scheduling analysis

You are advised to use an OS whose kernel version is 4.19 or later to run resource scheduling analysis tasks.

Roofline analysis

Set /proc/sys/kernel/perf_event_paranoid to 0 or a smaller value.

Lock and wait analysis

The environment must support the extended Berkeley Packet Filter (eBPF) configuration.