Tuning Performance Bottlenecks
Procedure
- Modify the source file, as shown in Figure 1. After modification, rename the file to memory_bound_after.c and upload it to the /home/demo directory.The modified source code uses the MemoryBoundBench_OPT function. This function applies block-based optimization by restricting the variable j, dividing the data into small blocks, and processing one block at a time to leverage the spatial and temporal locality of the cache. As a result, each data block can be reused in the cache, reducing cache misses.
- Compile the source file.
gcc -O2 -o /home/demo/ddrc_after /home/demo/memory_bound_after.c -fopenmp
- Switch to the installation directory of the Kunpeng Performance Boundary Analyzer. Replace xxx in the command with the actual version.
cd /home/ksys-x.x.x-Linux-aarch64
- Collect the application performance data after optimization.
./ksys collect /home/demo/ddrc_after
Figure 2 Memory access statistics
According to the memory access statistics, the total DDRC read bandwidth decreases from over 28,700 MB/s to over 790 MB/s. The reduced DDRC bandwidth indicates that the CPU is waiting less for memory data, resolving the memory bottleneck and improving overall data throughput efficiency.
- Switch to the demo directory and check the runtime of the application after the optimization.
- Switch to the demo directory.
cd /home/demo
- Check the application runtime after the optimization.
./ddrc_after
After the command is executed, it is found that the application runtime decreases from 6,372 ms to 3,503 ms. As a result of the optimization, the application computation performance improves.
Figure 3 Runtime
- Switch to the demo directory.
Parent topic: Practice 1: Memory Access Statistics Analysis
