Overview
The code samples in Table 1 described in this document demonstrate the functions of each tool of the Kunpeng DevKit. You can refer to these code samples when analyzing and optimizing your development projects in the Kunpeng DevKit.
Tool |
Working Mode |
Scenario |
Description |
Sample Code |
|---|---|---|---|---|
System Profiler |
CLI |
Sample 1: Using Roofline Analysis to Tune Applications |
For the same type of applications, you can use the roofline analysis function of the Kunpeng DevKit System Profiler to tune the roofline analysis task level by level in multiple dimensions, and therefore understand how to perform a roofline analysis task. |
matrix.h matrix.c matmult.h main.c intrinsic_matmult.c block_matmult.c base_matmult.c |
WebUI |
Sample 1: Matrix Analysis |
The Kunpeng DevKit System Profiler is used to tune the program for calculating the one-dimensional matrix based on the for loop. In this sample, the hotspot function analysis is performed to identify the hotspot function multiply for matrix calculation. Then, NEON instructions are used to tune the program, and the tuning effects are compared. |
multiply.c, multiply_simd.c, multiply_start.sh |
|
WebUI |
Sample 2: Detecting and Tuning Column-wise Access Loops |
The hotspot function analysis function of the Kunpeng DevKit System Profiler is used to compare the analysis results of miss events accessed by row and by column based on the two-dimensional array loop traversal program. The analysis result indicates that row-wise access can increase the CPU cache hit efficiency. |
cache_hit.c, cache_miss.c, miss_start.sh, hit_start.sh |
|
WebUI |
Sample 3: Frequent Lock Preemption |
Lock preemption and contention frequently occur for multi-thread programs, causing waste of CPU resources. Generally, the public resource contention can be addressed by analyzing and simplifying the service logic. In this sample, the resource scheduling analysis and lock & wait analysis functions of the Kunpeng DevKit System Profiler are used to analyze the service logic. You can reduce the lock size and the number of concurrent threads to reduce lock contention. |
pthread_mutex.c, pthread_atomic.c |
|
WebUI |
Sample 4: MPI Application Analysis |
The HPC application analysis function of the Kunpeng DevKit System Profiler helps you learn about the communication status of the application in each rank. |
ring.c |
|
WebUI |
Sample 5: Long Application Execution Caused by MPI Blocking Communication Functions |
In an MPI/OpenMP hybrid scenario, you can use the HPC application analysis function of the Kunpeng DevKit System Profiler to understand how to tune application performance in each scenario. |
send_recv.cpp |
|
WebUI |
Sample 6: NUMA Refined Analysis |
In the non-uniform memory access (NUMA) architecture, the Kunpeng DevKit System Profiler can be used to perform NUMA refined analysis. It collects the NUMA performance of all processes in the system and identifies top N (top 10 for example) processes with the poorest NUMA performance. It generates statistics matrix about memory access between NUMA nodes and identifies unbalanced memory access between nodes, based on which tuning suggestions are provided. |
None |