Rate This Document
Findability
Accuracy
Completeness
Readability

Tuning Performance Bottlenecks

Procedure

  1. Modify the source file, as shown in Figure 1. After modification, rename the file to memory_bound_after.c and upload it to the /home/demo directory.
    The modified source code uses the MemoryBoundBench_OPT function. This function applies block-based optimization by restricting the variable j, dividing the data into small blocks, and processing one block at a time to leverage the spatial and temporal locality of the cache. As a result, each data block can be reused in the cache, reducing cache misses.
    Figure 1 Modified source file
  2. Compile the source file.
    gcc -O2 -o /home/demo/ddrc_after /home/demo/memory_bound_after.c -fopenmp
  3. Switch to the installation directory of the Kunpeng Performance Boundary Analyzer. Replace xxx in the command with the actual version.
    cd /home/ksys-x.x.x-Linux-aarch64
  4. Collect the application performance data after optimization.
    ./ksys collect /home/demo/ddrc_after
    Figure 2 Memory access statistics

    According to the memory access statistics, the total DDRC read bandwidth decreases from over 28,700 MB/s to over 790 MB/s. The reduced DDRC bandwidth indicates that the CPU is waiting less for memory data, resolving the memory bottleneck and improving overall data throughput efficiency.

  5. Switch to the demo directory and check the runtime of the application after the optimization.
    1. Switch to the demo directory.
      cd /home/demo
    2. Check the application runtime after the optimization.
      ./ddrc_after

      After the command is executed, it is found that the application runtime decreases from 6,372 ms to 3,503 ms. As a result of the optimization, the application computation performance improves.

      Figure 3 Runtime