Rate This Document
Findability
Accuracy
Completeness
Readability

Introduction to the Kunpeng DevKit

The Kunpeng DevKit provides a series of tools covering application porting, testing, diagnosis, and tuning, enabling you to quickly build high-performance Kunpeng-compatible software. It facilitates application porting to the more efficient Kunpeng computing platform, which streamlines the development process. The Kunpeng DevKit offers:

  • Development environments in multiple programming languages, such as C, C++, Java, and Python. You can use the Kunpeng DevKit in the WebUI or IDE.
  • Optimized libraries, porting, and performance test tools, helping you exploit the advantages of the Kunpeng architecture and build high-performance applications on the Kunpeng platform.
  • Reference documents and code samples to help with Kunpeng application development.
  • Online technical support and a communication platform for troubleshooting and exchanges with other developers.

The Kunpeng DevKit includes the following tools:

Table 1 Tools provided by the Kunpeng DevKit

Tool

Description

System Migration

Offers wizard-driven, automatic migration across OSs, databases, middleware, and applications.

Porting Advisor

Ports software from x86 servers running Linux to Kunpeng servers running Linux, with necessary software scan and analysis capabilities.

Affinity Analyzer

Checks software code on the Kunpeng 920 platform to improve code quality and memory access performance.

System Profiler

Collects and analyzes performance data in multiple scenarios, and provides tuning suggestions based on the tuning system.

Java Profiler

Analyzes and optimizes the performance of Java programs running on Kunpeng servers.

System Diagnosis

Quickly locates and diagnoses component exceptions, and identifies memory usage problems in the source code.

System Migration

The System Migration tool supports only the Kunpeng platform. In a full package installation scenario, System Migration is integrated in the Porting Advisor tool.

Function

Description

System migration

Offers wizard-driven, automatic migration across OSs, databases, middleware, and applications.

Porting Advisor

The Porting Advisor simplifies the application porting process and supports scanning, analysis, and porting of software from x86 Linux to Kunpeng Linux. This tool can automatically analyze applications and generate guide reports, greatly improving code porting efficiency.

The following table lists functions supported by the Porting Advisor.

Table 2 Functions

Function

Description

Software porting assessment

  • Checks the shared object (SO) dependency libraries and executable files contained in software packages (RPM, DEB, TAR, ZIP, GZIP, and others) and assesses the portability of these files.
  • Checks the SO dependency libraries and binary files in the Java software packages (JAR, WAR, and EAR) and assesses the portability of these files.
  • Checks the SO dependency libraries and executable files in the specified software installation path and assesses the portability of these files.

Source code porting

  • Checks the C/C++/ASM/Fortran/Go software build project files and provides porting suggestions.
  • Checks the link libraries used by C/C++/Fortran/Go/interpreted language software build project files and provides porting suggestions.
  • Checks the C/C++/ASM/Fortran/Go/interpreted language software source code and provides porting suggestions. The tool supports porting of Fortran source code from the Intel Fortran compiler to the GCC Fortran compiler, and checks of compiler features and syntax extensions.
  • Checks the compatibility of the SO files loaded by the Python/Java/Scala program through the ctypes module.
  • Analyzes some x86 assembly instructions and converts them into equivalent Kunpeng assembly instructions.

Software package rebuild

Analyzes the composition of the software package to be ported on the Kunpeng platform, rebuilds and generates a software package compatible with the Kunpeng platform, or directly provides a software package that can be used.

Dedicated software porting

Modifies the source code of some common solutions on the Kunpeng platform, compiles the code, and generates the software packages compatible with the Kunpeng platform.

Affinity Analyzer

The Affinity Analyzer checks software code to improve code quality and memory access performance. It works only on the Kunpeng 920 processor platform. The following table lists functions supported by the Affinity Analyzer.

Table 3 Functions

Function

Description

64-bit running mode check

Identifies the applications to be ported from the 32-bit platform to the 64-bit platform and provides modification suggestions for the porting.

Byte alignment check

Checks the byte alignment of the structure variables in the source code.

Cache line alignment check

Checks the 128-byte alignment of structure variables in the C/C++ source code to improve memory access performance.

Static memory consistency check

Checks for any memory consistency problem when the source code is ported to the Kunpeng platform and provides suggestions on inserting memory barriers.

Vectorization check

Checks vectorizable code fragments and provides modification suggestions.

Matricization check

Checks matricizable code fragments and provides modification suggestions.

Build affinity

Analyzes the content in the makefile and CMakeLists.txt that can be replaced with content in the Kunpeng library, and provides replacement suggestions and function repair.

Calculation precision analysis

After instrumenting application functions using the precision analysis tool, run the functions on the x86 platform and Kunpeng platform. Then the precision analysis tool compares the output results to analyze the calculation precision differences between the platforms.

System Profiler

The System Profiler is a performance analysis tool for Kunpeng-powered servers. It collects performance data of processor hardware, operating system (OS), processes/threads, and functions, analyzes system performance metrics, locates system bottlenecks and hotspot functions, and provides tuning suggestions. This tool helps quickly locate and handle software performance problems. It is unavailable in x86 environments.

The Tuning Assistant is a tool for tuning Kunpeng-powered servers. It systematically organizes performance metrics and provides guidance for analyzing performance bottlenecks, thus realizing quick tuning.

Table 4 Functions

Task Type

Description

Tuning analysis

The Tuning Assistant systematically organizes and analyzes performance metrics, hotspot functions, and system configurations to form a system resource consumption chain. It provides guidance for analyzing performance bottlenecks based on tuning paths and gives tuning suggestions and operation guides for each tuning path to implement fast tuning.

Comparison analysis

For the same type of analysis tasks, you can select the same node or different nodes to compare the analysis results. In this way, you can quickly learn the differences between different analysis results, locate performance metric changes, and identify the effect of optimization methods.

HPC cluster check

The tool checks the hardware and software configurations of a specified MPI cluster and provides a report on software and hardware configuration consistency between nodes in the cluster. The configuration items include CPUs, GPUs, interconnection, memory, NICs, drives, OS, kernel, environment variables, MPI/OpenMP, and common HPC dependency libraries. The tool gives tuning suggestions on configurations that do not comply with the best practices of the Kunpeng platform.

HPC application analysis

The tool collects Performance Monitor Unit (PMU) events of the system and the key metrics of OpenMP and MPI applications to help you accurately obtain the serial and parallel time of the parallel region and barrier-to-barrier, calibrated L2 micro-architecture metrics, instruction distribution, L3 usage, and memory bandwidth.

Overall analysis

The tool collects the software and hardware configuration information of the entire system and the running status of system resources, such as CPUs, memory, storage I/O, and network I/O, to obtain performance metrics such as usage, saturation, and errors. These metrics help identify performance bottlenecks. The tool also provides performance tuning suggestions for some metrics based on the benchmark data and experience.

The tool checks the hardware configuration, system configuration, and component configuration in big data, database, and distributed storage scenarios, displays the configuration items that are not optimal, and analyzes and provides typical hardware configuration and software version information.

Microarchitecture analysis

The tool obtains the running status of instructions on the CPU pipeline based on Arm PMU events. It helps quickly locate the performance bottleneck of the current application on the CPU and modify the programs to maximize the utilization of hardware resources.

Memory access analysis

By analyzing the events related to the CPU's access to the cache and memory, the tool identifies potential performance bottlenecks on memory access, locates the possible causes, and provides tuning suggestions.
  1. Memory access statistics analysis

    Based on the PMU events related to the processor's access to the cache and memory, the tool analyzes the number of access operations, hit rate, and bandwidth, including:

    • Access hit rate and bandwidth of the L1C, L2C, L3C, and TLB.
    • HHA access rate
    • DDR access bandwidth and access operations
  2. Miss event analysis

    This analysis is based on the Arm Statistical Profiling Extension (SPE) capability. SPE samples instructions and records information about triggered events, including accurate PC pointer information. This capability can be used to analyze miss events, such as LLC misses, TLB misses, remote access, and long latency loads, and accurately associate the code that causes the events. Based on the information, you can modify your programs to reduce the probability of certain events and improve performance.

  3. NUMA refined analysis

    This analysis is based on the Arm Statistical Profiling Extension (SPE) capability. SPE samples instructions and records information about triggered events, including accurate PC pointer information. The tool leverages the SPE capability to collect the NUMA performance of all processes in the system, find the top N (for example, N = 10) processes with the poorest NUMA performance and the hotspot memory areas of these processes, and identify the inter-NUMA node memory access statistics matrix and the inter-node memory access imbalance status. Then related tuning suggestions are provided.

I/O analysis

The tool analyzes the storage I/O performance. By analyzing block storage devices, the tool obtains performance data such as the number of I/O operations, I/O data size, I/O queue depth, and I/O operation latency, and identifies specific I/O operations, processes, threads, call stacks, and I/O APIs in the application layer. Based on the I/O performance data, the tool provides tuning suggestions.

Process/Thread performance analysis

The tool collects information about the CPU, memory, and storage I/O resources used by processes or threads to obtain metrics, such as the usage, saturation, and number of errors and identify performance bottlenecks of processes/threads. Then, the tool provides tuning suggestions for some metrics based on the benchmark data and experience. The tool also supports analysis of the system call information for a single process.

Resource scheduling analysis

The tool analyzes system resource scheduling based on CPU scheduling events, including:

  1. Running status of CPU cores at each time point, such as Idle and Running, and the duration proportion of each status.
  2. Running status of processes or threads at each time point, which can be Wait, Schedule, and Running, and the duration proportion of each state.
  3. Process/thread switching information, including the number of switchovers, average scheduling delay, minimum scheduling delay, and maximum delay time.
  4. Number of operations that each process or thread switches between different NUMA nodes. If the number of switchovers is greater than the reference value, core binding suggestions will be provided.

Hotspot function analysis

The tool analyzes C/C++ program code, identifies performance bottlenecks, and displays hotspot functions. It also displays function call relationship in flame graphs and provides the tuning path.

Lock and wait analysis

The tool analyzes the lock and wait functions (including sleep, usleep, mutex, cond, spinlock, rwlock, and semaphore) of glibc and open-source software, such as MySQL and OpenMP, associates the processes and call sites to which the lock and wait functions belong, and provides tuning suggestions based on existing experience.

Roofline analysis

Helps pinpoint application bottlenecks on a given hardware platform and optimize an application accordingly.

AI tuning analysis

Leverages Huawei-developed high-performance AI tuning solution to tune applications in database and big data scenarios according to the test case you specify. After the analysis, tuning suggestions on parameter configurations in complex scenarios are provided.

Java Profiler

The Java Profiler is a performance analysis and tuning tool for Java programs running on Kunpeng-powered servers. It displays information about the heaps, threads, locks, and garbage collection (GC) of Java programs in graphics, collects information about hotspot functions, and helps locate program bottlenecks.

Table 5 Functions

Task Type

Description

Real-time profiling

The tool analyzes the target Java virtual machine (JVM) and Java programs. Specifically, the tool analyzes the heap, GC activities, thread status, and performance of upper-layer Java programs, and provides information such as call chain analysis results, hotspot functions, lock analysis results, program thread status, and object distribution. The JVM running data is obtained in real time using the agent for precise analysis.

The analysis results include the following content:

  1. Overview
    • Real-time display of the JVM system status.
    • Real-time display of the JVM information, including the heap size, GC activities, number of threads, number of loaded classes, and CPU usage.
  2. Thread information

    Displays the real-time active thread status and current thread dump in the JVM, displays the thread lock status in graphics, and analyzes the thread deadlock.

  3. Memory information
    • By capturing heap snapshots, the tool analyzes the heap histogram distribution and dominator tree of an application at a certain point of time and traces the reference relationship chain from each Java object in the heap memory to the GC root, helping locate potential memory problems. The tool compares and analyzes heap snapshots at different points of time, and analyzes the changes of heap usage and allocation, helping to detect exceptions.
    • Obtains the quantity and sizes of objects created in the Java heap and displays the memory usage in real time.
  4. Hotspot information

    Hotspot methods analyzed by the tool are displayed in icicle graphs. Hotspot methods at different layers (such as the Java call layer, JNI layer, Native layer, and kernel layer) are distinguished by different colors. You can look into details about the bytecode (optional) of the Java method, machine instructions generated by the JVM JIT compiler, and hotspot distribution of these instructions. Reasons are provided if the bytecode cannot be viewed. The tool also collects call chains of specified entry methods as well as data such as method calling relationships and time consumption during the sampling period. These are displayed in a tree chart.

  5. GC information

    You can collect and analyze GC events in the target JVM process in real time and analyze factors such as GC causes, phases, performance, and pauses to locate potential GC-related memory issues and performance bottlenecks.

  6. I/O information

    Analyzes the file I/O, socket I/O latency, and consumed bandwidth of an application in real time to identify hotspot I/O operations.

  7. Database information
    • Monitors and analyzes database connection pools. It monitors connections in a database connection pool, helping locate potential connection leaks and providing tuning suggestions for improper connection pool configurations.
    • Analyzes hotspot SQL operations of Java database connectivity (JDBC). It records the SQL call time, duration, and stack tracing in applications, helping locate the hotspot SQL operation that takes the longest time.
    • Analyzes hotspot NoSQL operations. It records the NoSQL call time, duration, and stack tracing in applications, helping locate the hotspot NoSQL operation that takes the longest time.
  8. HTTP information

    Records the time and duration of HTTP requests in applications and identifies hotspot HTTP requests.

  9. Snapshot information

    Snapshots can be generated during real-time analysis of heap, I/O, and workload data. By comparing snapshots, the tool helps to detect the trends of resource and service metrics and to identify potential risks on resource leak or performance deterioration.

Sampling profiling

The tool collects internal activities and performance events of the JVM through data sampling, and performs offline analysis through data recording and playback. This method requires only a small overhead and has little impact on services, making it suitable for large Java programs.

The analysis results include the following content:

  1. Overview
    • JVM system status.
    • Sampling and playback of recorded JVM information, including the heap usage, GC activities, I/O consumption, and CPU usage.
  2. Thread dump and lock analysis results
    • Analyzes the thread status and locks of programs. It obtains the thread status changes and current thread dump within the sampling time, displays the thread lock status in graphics based on the thread dump, and analyzes the thread deadlock.
    • The tool analyzes and estimates the thread block objects and block time.
  3. Calling method–based sampling analysis results
    • Analyzes the CPU cycle proportion and location of hotspot functions in Java and native code.
    • Displays hotspot functions and their call stacks in a flame graph.
  4. Memory analysis results
    • Displays allocation of Java objects in the heap, helping to detect potential problems by quickly locating the objects that consume the most heap resources or that are allocated the most heap resources. The tool uses stack trace to locate potential memory problems.
    • By sampling Java objects with a long retention period, you can identify potential heap memory leaks in Java applications and locate possible causes.
  5. GC analysis results

    Displays the Java GC configuration, heap size changes, and GC event occurrence. You can analyze and adjust the current GC policy by observing the heap changes, GC activity frequency, and pause time.

  6. I/O analysis results

    Helps analyze the file read/write statistics and socket traffic usage of the target Java application to detect I/O performance bottlenecks. You can also analyze the application's read/write statistics on files, including the read/write path, read/write frequency, read/write rate, total read/write amount, stack tracing (configurable), and time-based change graph.

System Diagnosis

Table 6 Task description

Task Type

Description

Memory usage diagnosis

The tool analyzes the memory usage problems (including unreleased memory and abnormal memory releases) of the application to associate the call stack and source code.

Memory overwriting diagnosis

The tool analyzes memory overwriting problems of the application, provides the memory overwriting type and memory access information, and associates the call stack and source code.

Network I/O diagnosis

The tool performs pressure tests on the network to obtain the maximum network capability and provide basic reference data for network I/O performance tuning. It diagnoses the network, locates network problems, and resolves network I/O performance problems caused by network configurations and exceptions. The functions include network dialing test, packet loss diagnosis (not supported by RDMA), network packet capture (not supported by RDMA), and system load monitoring. Network I/O diagnosis can collect statistics on network data flows, analyze UDP and TCP data flows, RDMA RoCE v2, and IB data flows in the IPv4 and IPv6 protocol stacks, and collect statistics on the execution of data flows on different processor cores in different phases.

Storage I/O diagnosis

The tool performs pressure tests on the storage I/O to obtain the maximum capability of the storage device and provide basic reference data for storage I/O performance tuning. It supports storage I/O pressure tests to obtain the maximum storage I/O capabilities, including throughput, IOPS, and latency.

Related Concepts

Table 7 Concepts

Concept

Description

SO dependency library

Linux shared object files, named in filename.so.version format, for example, libname.so.1.1.1.

Dependency dictionary

A list that the Porting Advisor uses to record the SO files, software supported by the Kunpeng platform, and the installation status (installation from binary packages or installation from source code). You can download and update the dependency dictionary.

Software build project file

The common software build tools of C/C++/ASM/Fortran/Go are Make and CMake, and the corresponding build files are Makefile and CMakeLists.txt.

IPC

Instructions per cycle (IPC) is the average number of instructions executed by a CPU in each clock cycle. It reflects the smoothness of CPU execution. If a four-transmitter Kunpeng 920 processor executes four instructions in each clock cycle when its pipeline operates in full load, the IPC is 4.0. An IPC value closer to 4.0 indicates that the program uses the processor features to a greater extent.

CPU Cycles performance event

Based on the event sampling principle and performance events, performance analysis can be performed on performance metrics related to processors and operating systems. This metric can be used to locate performance bottlenecks and hotspot code.

CPU Cycles is the default performance event, which is also called clock tick. The sampling is performed based on tick interrupts. That is, sampling is triggered when a tick interrupt occurs, and the current context of the program is determined in the sampling point.

USE

The Utilization, Saturation, and Errors (USE) method is used to analyze the utilization, saturation, and errors of all resources to identify performance bottlenecks.

  • Resources: physical components of servers, including CPUs, memory, storage devices, and network devices. The software that provides similar metrics can also be considered as a resource.
  • Utilization: indicates the percentage of time that resources are used for service work within a specified period.
  • Saturation: indicates the degree to which resources cannot accept more work (the kernel usually has a waiting queue).
  • Error: indicates the number of error events.

Real-time profiling

Real-time profiling is a dynamic application analysis method, including the analysis on target JVMs and Java programs. It is used to analyze the internal distribution of resource consumption, method calling frequency, and time consumption during application running. This method is often used to assist application performance bottleneck locating and performance tuning.

Real-time profiling obtains the calling status of all methods in specific code by instrumenting the classes and methods of programs, which may greatly affect the application performance.

Sampling profiling

Sampling profiling is to collect internal activities and performance events of the JVM through data sampling, and performs offline analysis through data recording and playback. Sampling profiling does not require you to modify the application code, which has little impact on performance. It is suitable for large-scale Java programs. The precision of sampling profiling is lower than that of real-time profiling because the former only collects data periodically.

Upper-layer application workload

Workload analysis is to dynamically modify upper-layer application code and embed hooks to collect specific application performance data, helping to obtain the code performance and locate specific code.