Rate This Document
Findability
Accuracy
Completeness
Readability

Introduction to the Kunpeng DevKit Plugin

The Kunpeng DevKit, an end-to-end tool kit designed for developers, offers a series of plug-and-play tools for software development, porting, compilation and debugging, and performance tuning on the Kunpeng platform based on Visual Studio Code (VS Code). The Kunpeng DevKit consists of the Porting Advisor, Affinity Analyzer, Development Assistant, Compiler and Debugger, System Profiler, Java Profiler, and System Diagnosis.

It supports the IDE frontend GUI and one-click backend installation, facilitates coding, automatically detects and installs Kunpeng compilers, performs compilation and debugging, visualizes cases, provides coding assistance, and analyzes and scans projects. You can use the Kunpeng DevKit plugin to directly deploy the whole Kunpeng DevKit, or install certain required subtools only.

Table 1 lists the tools provided by the Kunpeng DevKit:

Table 1 Tools provided by the Kunpeng DevKit

Tool

Description

Software Package

Porting Advisor

Ports software from x86 servers running Linux to Kunpeng servers running Linux, with necessary software scan and analysis capabilities.

DevKit-Porting-Advisor-x.x.x-Linux-platform.tar.gz

Affinity Analyzer

Checks software code on the Kunpeng 920 platform to improve code quality and memory access performance.

DevKit-Affinity-Analyzer-x.x.x-Linux-platform.tar.gz

Development Assistant

Leverages Kunpeng computing capabilities and high-performance components to help you develop Kunpeng applications with ease. This tool can be used only in the integrated development environment (IDE).

DevKit-Devtools-x.x.x-Linux-platform.tar.gz

Compiler and Debugger

Supports remote compilation and debugging on the Kunpeng platform and improves the compilation and debugging efficiency through a visualized UI. This tool can be used only in the integrated development environment (IDE).

DevKit-Debugger-x.x.x-Linux-platform.tar.gz

System Profiler

Collects and analyzes performance data in multiple scenarios, and provides tuning suggestions based on the tuning system.

DevKit-Sys-Perf-x.x.x-Linux-platform.tar.gz

Java Profiler

Analyzes and optimizes the performance of Java programs running on Kunpeng servers.

DevKit-Java-Perf-x.x.x-Linux-platform.tar.gz

System Diagnosis

Quickly locates and diagnoses component exceptions, and identifies memory usage problems in the source code.

DevKit-Sys-Diagnosis-x.x.x-Linux-platform.tar.gz

  • In a software package name, x.x.x indicates the software version and platform indicates the platform type, x86-64 (for x86-based servers) or Kunpeng (for servers powered by the Kunpeng 920 processor).
  • A software package contains the Open_Source_Software_Notice.txt file.

Porting Advisor

The Porting Advisor simplifies the application porting process and supports scanning, analysis, and porting of software from x86 Linux to Kunpeng Linux. This tool can automatically analyze applications and generate guide reports, greatly improving code porting efficiency.

The following table lists functions supported by the Porting Advisor.

Table 2 Functions

Function

Description

Software porting assessment

  • Checks the shared object (SO) dependency libraries and executable files contained in software packages (RPM, DEB, TAR, ZIP, GZIP, and others) and assesses the portability of these files.
  • Checks the SO dependency libraries and binary files in the Java software packages (JAR, WAR, and EAR) and assesses the portability of these files.
  • Checks the SO dependency libraries and executable files in the specified software installation path and assesses the portability of these files.

Source code porting

  • Checks the C/C++/ASM/Fortran/Go software build project files and provides porting suggestions.
  • Checks the link libraries used by C/C++/Fortran/Go/interpreted language software build project files and provides porting suggestions.
  • Checks the C/C++/ASM/Fortran/Go/interpreted language software source code and provides porting suggestions. The tool supports porting of Fortran source code from the Intel Fortran compiler to the GCC Fortran compiler, and checks of compiler features and syntax extensions.
  • Checks the compatibility of the SO files loaded by the Python/Java/Scala program through the ctypes module.
  • Analyzes some x86 assembly instructions and converts them into equivalent Kunpeng assembly instructions.

Software package rebuild

Analyzes the composition of the software package to be ported on the Kunpeng platform, rebuilds and generates a software package compatible with the Kunpeng platform, or directly provides a software package that can be used.

Dedicated software porting

Modifies the source code of some common solutions on the Kunpeng platform, compiles the code, and generates the software packages compatible with the Kunpeng platform.

Affinity Analyzer

The Affinity Analyzer checks software code to improve code quality and memory access performance. It works only on the Kunpeng 920 processor platform. The following table lists functions supported by the Affinity Analyzer.

Table 3 Functions

Function

Description

64-bit running mode check

Identifies the applications to be ported from the 32-bit platform to the 64-bit platform and provides modification suggestions for the porting.

Byte alignment check

Checks the byte alignment of the structure variables in the source code.

Cache line alignment check

Checks the 128-byte alignment of structure variables in the C/C++ source code to improve memory access performance.

Static memory consistency check

Checks for any memory consistency problem when the source code is ported to the Kunpeng platform and provides suggestions on inserting memory barriers.

Vectorization check

Checks vectorizable code fragments and provides modification suggestions.

Matricization check

Checks matricizable code fragments and provides modification suggestions.

Build affinity

Analyzes the content in the makefile and CMakeLists.txt that can be replaced with content in the Kunpeng library, and provides replacement suggestions and function repair.

Calculation precision analysis

After instrumenting application functions using the precision analysis tool, run the functions on the x86 platform and Kunpeng platform. Then the precision analysis tool compares the output results to analyze the calculation precision differences between the platforms.

Development Assistant

You can use the Development Assistant to create Kunpeng application projects. It supports C and C++. During coding, it automatically queries the function library provided by the Kunpeng Library, as well as highlights and associates those replacement libraries and functions.

The following table lists functions supported by the Development Assistant.

Table 4 Functions

Function

Description

General-purpose computing application

Provides Kunpeng general-purpose computing SDKs. You can create general-purpose computing application projects to facilitate development of basic applications, including acceleration library, hardware acceleration, and homogeneous acceleration framework (HAF) applications.

Secure computing application

Enables you to create GlobalPlatform-compliant secure computing projects and high-level language projects with ease, reconstruct existing Java or Python projects, deploy SDKs, and check the compilation environment.

HPC application

By creating high-performance computing (HPC) applications based on the Hyper MPI and math libraries, you can extend sample projects to improve development efficiency.

DPAK application

Enables you to create a Data Processing & Acceleration Kit (DPAK) project with ease and provides the Kunpeng DPAK SDK. The Kunpeng DPAK provides service offload capabilities for SmartNIC scenarios, including network offload and virtualization offload.

Data I/O application

Constructed based on the Kunpeng Storage Acceleration Library (KSAL) and uses Kunpeng storage acceleration algorithms to improve I/O read performance.

Dictionary management

You can import local dictionary files.

Coding assistance

Provides hints and function search for functions in the Kunpeng Library; associates and highlights the tuned functions in the Kunpeng Library during coding.

Compiler and Debugger

The Compiler and Debugger allows you to deploy Kunpeng compilers in one click and debug GPU-based applications in a single-node system. It leverages CUDA-GDB debugging to debug GPU-based applications on a unified page. It also supports parallel debugging of applications on multiple nodes in HPC scenarios and remote compilation and debugging on the Kunpeng platform.

Table 5 Functions

Function

Description

Compiler deployment

Supports one-click deployment of the GCC for openEuler, BiSheng JDK, and BiSheng Compiler.

Common compilation

Provides basic remote compilation capabilities and supports visualized compilation parameter configuration, one-click compilation, and real-time display of compilation information.

Automatic FDO compilation

Automatic feedback-directed optimization (FDO) is a technique that simplifies the PGO deployment process. It samples program runtime profile to indirectly obtain the program execution status.

General application debugging

Provides remote debugging capabilities on the Kunpeng platform and a graphical user interface (GUI), improving debugging efficiency.

HPC parallel application debugging

Supports concurrent debugging on multiple nodes in HPC scenarios. MPI applications can only be debugged in launch mode. Parallel computing involves task parallelism and data parallelism. That is, different tasks are executed or different data is stored on different nodes. Currently, HPC parallel tasks support only CPU debugging.

CUDA application debugging

Allows you to debug Compute Unified Device Architecture (CUDA) programs on the Kunpeng platform. It leverages CUDA-GDB to debug GPU applications on a unified interface.

Security application debugging

After you create a Java or Python project in the Development Assistant, a general compilation task and a security application debugging task are generated in the Compiler and Debugger. You can compile and run the tasks based on your requirements or create a security application debugging task on your own.

DPU debugging

The Compiler and Debugger supports DPU debugging based on the Kunpeng platform to implement the DPU X Debugger (XDB) debugging capability. The XDB is a DPU debugging tool. It allows you to debug microcode programs on CPUs by using the GNU debugger (GDB), view registers, local variables, single-port RAM (spram), thread variables, and call stacks, and set breakpoints and monitoring.

System Profiler

The System Profiler is a performance analysis tool for Kunpeng-powered servers. It collects performance data of processor hardware, operating system (OS), processes/threads, and functions, analyzes system performance metrics, locates system bottlenecks and hotspot functions, and provides tuning suggestions. This tool helps quickly locate and handle software performance problems. It is unavailable in x86 environments.

The Tuning Assistant is a tool for tuning Kunpeng-powered servers. It systematically organizes performance metrics and provides guidance for analyzing performance bottlenecks, thus realizing quick tuning.

Table 6 Functions

Task Type

Description

Tuning analysis

The Tuning Assistant systematically organizes and analyzes performance metrics, hotspot functions, and system configurations to form a system resource consumption chain. It provides guidance for analyzing performance bottlenecks based on tuning paths and gives tuning suggestions and operation guides for each tuning path to implement fast tuning.

Comparison analysis

For the same type of analysis tasks, you can select the same node or different nodes to compare the analysis results. In this way, you can quickly learn the differences between different analysis results, locate performance metric changes, and identify the effect of optimization methods.

HPC cluster check

The tool checks the hardware and software configurations of a specified MPI cluster and provides a report on software and hardware configuration consistency between nodes in the cluster. The configuration items include CPUs, GPUs, interconnection, memory, NICs, drives, OS, kernel, environment variables, MPI/OpenMP, and common HPC dependency libraries. The tool gives tuning suggestions on configurations that do not comply with the best practices of the Kunpeng platform.

HPC application analysis

The tool collects Performance Monitor Unit (PMU) events of the system and the key metrics of OpenMP and MPI applications to help you accurately obtain the serial and parallel time of the parallel region and barrier-to-barrier, calibrated L2 micro-architecture metrics, instruction distribution, L3 usage, and memory bandwidth.

Overall analysis

The tool collects the software and hardware configuration information of the entire system and the running status of system resources, such as CPUs, memory, storage I/O, and network I/O, to obtain performance metrics such as usage, saturation, and errors. These metrics help identify performance bottlenecks. The tool also provides performance tuning suggestions for some metrics based on the benchmark data and experience.

The tool checks the hardware configuration, system configuration, and component configuration in big data, database, and distributed storage scenarios, displays the configuration items that are not optimal, and analyzes and provides typical hardware configuration and software version information.

Microarchitecture analysis

The tool obtains the running status of instructions on the CPU pipeline based on Arm PMU events. It helps quickly locate the performance bottleneck of the current application on the CPU and modify the programs to maximize the utilization of hardware resources.

Memory access analysis

By analyzing the events related to the CPU's access to the cache and memory, the tool identifies potential performance bottlenecks on memory access, locates the possible causes, and provides tuning suggestions.
  1. Memory access statistics analysis

    Based on the PMU events related to the processor's access to the cache and memory, the tool analyzes the number of access operations, hit rate, and bandwidth, including:

    • Access hit rate and bandwidth of the L1C, L2C, L3C, and TLB.
    • HHA access rate
    • DDR access bandwidth and access operations
  2. Miss event analysis

    This analysis is based on the Arm Statistical Profiling Extension (SPE) capability. SPE samples instructions and records information about triggered events, including accurate PC pointer information. This capability can be used to analyze miss events, such as LLC misses, TLB misses, remote access, and long latency loads, and accurately associate the code that causes the events. Based on the information, you can modify your programs to reduce the probability of certain events and improve performance.

  3. NUMA refined analysis

    This analysis is based on the Arm Statistical Profiling Extension (SPE) capability. SPE samples instructions and records information about triggered events, including accurate PC pointer information. The tool leverages the SPE capability to collect the NUMA performance of all processes in the system, find the top N (for example, N = 10) processes with the poorest NUMA performance and the hotspot memory areas of these processes, and identify the inter-NUMA node memory access statistics matrix and the inter-node memory access imbalance status. Then related tuning suggestions are provided.

I/O analysis

The tool analyzes the storage I/O performance. By analyzing block storage devices, the tool obtains performance data such as the number of I/O operations, I/O data size, I/O queue depth, and I/O operation latency, and identifies specific I/O operations, processes, threads, call stacks, and I/O APIs in the application layer. Based on the I/O performance data, the tool provides tuning suggestions.

Process/Thread performance analysis

The tool collects information about the CPU, memory, and storage I/O resources used by processes or threads to obtain metrics, such as the usage, saturation, and number of errors and identify performance bottlenecks of processes/threads. Then, the tool provides tuning suggestions for some metrics based on the benchmark data and experience. The tool also supports analysis of the system call information for a single process.

Resource scheduling analysis

The tool analyzes system resource scheduling based on CPU scheduling events, including:

  1. Running status of CPU cores at each time point, such as Idle and Running, and the duration proportion of each status.
  2. Running status of processes or threads at each time point, which can be Wait, Schedule, and Running, and the duration proportion of each state.
  3. Process/thread switching information, including the number of switchovers, average scheduling delay, minimum scheduling delay, and maximum delay time.
  4. Number of operations that each process or thread switches between different NUMA nodes. If the number of switchovers is greater than the reference value, core binding suggestions will be provided.

Hotspot function analysis

The tool analyzes C/C++ program code, identifies performance bottlenecks, and displays hotspot functions. It also displays function call relationship in flame graphs and provides the tuning path.

Lock and wait analysis

The tool analyzes the lock and wait functions (including sleep, usleep, mutex, cond, spinlock, rwlock, and semaphore) of glibc and open-source software, such as MySQL and OpenMP, associates the processes and call sites to which the lock and wait functions belong, and provides tuning suggestions based on existing experience.

Roofline analysis

Helps pinpoint application bottlenecks on a given hardware platform and optimize an application accordingly.

AI tuning analysis

Leverages Huawei-developed high-performance AI tuning solution to tune applications in database and big data scenarios according to the test case you specify. After the analysis, tuning suggestions on parameter configurations in complex scenarios are provided.

Java Profiler

The Java Profiler is a performance analysis and tuning tool for Java programs running on Kunpeng-powered servers. It displays information about the heaps, threads, locks, and garbage collection (GC) of Java programs in graphics, collects information about hotspot functions, and helps locate program bottlenecks.

Table 7 Functions

Task Type

Description

Real-time profiling

The tool analyzes the target Java virtual machine (JVM) and Java programs. Specifically, the tool analyzes the heap, GC activities, thread status, and performance of upper-layer Java programs, and provides information such as call chain analysis results, hotspot functions, lock analysis results, program thread status, and object distribution. The JVM running data is obtained in real time using the agent for precise analysis.

The analysis results include the following content:

  1. Overview
    • Real-time display of the JVM system status.
    • Real-time display of the JVM information, including the heap size, GC activities, number of threads, number of loaded classes, and CPU usage.
  2. Thread information

    Displays the real-time active thread status and current thread dump in the JVM, displays the thread lock status in graphics, and analyzes the thread deadlock.

  3. Memory information
    • By capturing heap snapshots, the tool analyzes the heap histogram distribution and dominator tree of an application at a certain point of time and traces the reference relationship chain from each Java object in the heap memory to the GC root, helping locate potential memory problems. The tool compares and analyzes heap snapshots at different points of time, and analyzes the changes of heap usage and allocation, helping to detect exceptions.
    • Obtains the quantity and sizes of objects created in the Java heap and displays the memory usage in real time.
  4. Hotspot information

    Hotspot methods analyzed by the tool are displayed in icicle graphs. Hotspot methods at different layers (such as the Java call layer, JNI layer, Native layer, and kernel layer) are distinguished by different colors. You can look into details about the bytecode (optional) of the Java method, machine instructions generated by the JVM JIT compiler, and hotspot distribution of these instructions. Reasons are provided if the bytecode cannot be viewed. The tool also collects call chains of specified entry methods as well as data such as method calling relationships and time consumption during the sampling period. These are displayed in a tree chart.

  5. GC information

    You can collect and analyze GC events in the target JVM process in real time and analyze factors such as GC causes, phases, performance, and pauses to locate potential GC-related memory issues and performance bottlenecks.

  6. I/O information

    Analyzes the file I/O, socket I/O latency, and consumed bandwidth of an application in real time to identify hotspot I/O operations.

  7. Database information
    • Monitors and analyzes database connection pools. It monitors connections in a database connection pool, helping locate potential connection leaks and providing tuning suggestions for improper connection pool configurations.
    • Analyzes hotspot SQL operations of Java database connectivity (JDBC). It records the SQL call time, duration, and stack tracing in applications, helping locate the hotspot SQL operation that takes the longest time.
    • Analyzes hotspot NoSQL operations. It records the NoSQL call time, duration, and stack tracing in applications, helping locate the hotspot NoSQL operation that takes the longest time.
  8. HTTP information

    Records the time and duration of HTTP requests in applications and identifies hotspot HTTP requests.

  9. Snapshot information

    Snapshots can be generated during real-time analysis of heap, I/O, and workload data. By comparing snapshots, the tool helps to detect the trends of resource and service metrics and to identify potential risks on resource leak or performance deterioration.

Sampling profiling

The tool collects internal activities and performance events of the JVM through data sampling, and performs offline analysis through data recording and playback. This method requires only a small overhead and has little impact on services, making it suitable for large Java programs.

The analysis results include the following content:

  1. Overview
    • JVM system status.
    • Sampling and playback of recorded JVM information, including the heap usage, GC activities, I/O consumption, and CPU usage.
  2. Thread dump and lock analysis results
    • Analyzes the thread status and locks of programs. It obtains the thread status changes and current thread dump within the sampling time, displays the thread lock status in graphics based on the thread dump, and analyzes the thread deadlock.
    • The tool analyzes and estimates the thread block objects and block time.
  3. Calling method–based sampling analysis results
    • Analyzes the CPU cycle proportion and location of hotspot functions in Java and native code.
    • Displays hotspot functions and their call stacks in a flame graph.
  4. Memory analysis results
    • Displays allocation of Java objects in the heap, helping to detect potential problems by quickly locating the objects that consume the most heap resources or that are allocated the most heap resources. The tool uses stack trace to locate potential memory problems.
    • By sampling Java objects with a long retention period, you can identify potential heap memory leaks in Java applications and locate possible causes.
  5. GC analysis results

    Displays the Java GC configuration, heap size changes, and GC event occurrence. You can analyze and adjust the current GC policy by observing the heap changes, GC activity frequency, and pause time.

  6. I/O analysis results

    Helps analyze the file read/write statistics and socket traffic usage of the target Java application to detect I/O performance bottlenecks. You can also analyze the application's read/write statistics on files, including the read/write path, read/write frequency, read/write rate, total read/write amount, stack tracing (configurable), and time-based change graph.

System Diagnosis

Table 8 Task description

Task Type

Description

Memory usage diagnosis

The tool analyzes the memory usage problems (including unreleased memory and abnormal memory releases) of the application to associate the call stack and source code.

Memory overwriting diagnosis

The tool analyzes memory overwriting problems of the application, provides the memory overwriting type and memory access information, and associates the call stack and source code.

Network I/O diagnosis

The tool performs pressure tests on the network to obtain the maximum network capability and provide basic reference data for network I/O performance tuning. It diagnoses the network, locates network problems, and resolves network I/O performance problems caused by network configurations and exceptions. The functions include network dialing test, packet loss diagnosis (not supported by RDMA), network packet capture (not supported by RDMA), and system load monitoring. Network I/O diagnosis can collect statistics on network data flows, analyze UDP and TCP data flows, RDMA RoCE v2, and IB data flows in the IPv4 and IPv6 protocol stacks, and collect statistics on the execution of data flows on different processor cores in different phases.

Storage I/O diagnosis

The tool performs pressure tests on the storage I/O to obtain the maximum capability of the storage device and provide basic reference data for storage I/O performance tuning. It supports storage I/O pressure tests to obtain the maximum storage I/O capabilities, including throughput, IOPS, and latency.