Rate This Document
Findability
Accuracy
Completeness
Readability

Performance Analysis Process

Prerequisites

  • The server and OS are running properly.
  • An SSH remote login tool has been installed on the local PC.
  • The Kunpeng Performance Boundary Analyzer and System Profiler have been installed in the target environment and are running properly.

Procedure

  1. Download the memory_bound.c and utils.h files from GitHub, upload them to the /home/demo directory, and run the following command to switch to the source code directory:
    cd /home/demo
  2. Compile the source file.
    gcc -O2 -o /home/demo/ddrc_before /home/demo/memory_bound.c -fopenmp
  3. View the application runtime.
    ./ddrc_before

    After the command is executed, the application runtime is 6,372 ms. You can use the Kunpeng Performance Boundary Analyzer to check for performance issues in the application. If any optimization opportunities are found, update the source code accordingly. The current application runtime is used as the baseline for measuring time overhead.

    Figure 1 Runtime
  4. Use the Kunpeng Performance Boundary Analyzer to locate issues.
    Go to the installation directory of the Kunpeng Performance Boundary Analyzer. Replace xxx in the command with the actual version.
    cd /home/ksys-x.x.x-Linux-aarch64
  5. Collect the application performance data.
    ./ksys collect /home/demo/ddrc_before

    /home/demo/ddrc_before indicates the application whose data is to be collected.

    Figure 2 Memory access statistics

    According to memory access statistics, the ddrc_rd_bw of node 0 is high. When the Kunpeng server reads data from memory, it passes through the DDR controller (DDRC device). If the program has a memory bottleneck, the DDRC read/write bandwidth can become abnormally high, indicating that the program performs a large amount of memory read/write operations. The memory access analysis function of the System Profiler collects detailed memory performance data and provides in-depth prompts and evaluation mechanisms. For further investigation, it is recommended to use the memory access analysis function.

  6. Use the System Profiler to further analyze the program.
    Switch to the installation directory of the System Profiler. Replace xxx in the command with the actual version.
    cd /home/DevKit-Tuner-CLI-x.x.x-Linux-Kunpeng
  7. Create a script file. The memory access statistics analysis function of the System Profiler does not support an application-specific mode, so you can use a script file instead.
    1. Run the following command to create a script:
      vim memory.sh
    2. Press i to enter the insert mode.
    3. Add the following content to the script. Replace the example version xxx with the actual version.
      /home/DevKit-Tuner-CLI-x.x.x-Linux-Kunpeng/devkit tuner memory &
      TUNER_PID=$!
      /home/demo/ddrc_before
      pkill -P $TUNER_PID 2>/dev/null
      kill $TUNER_PID 2>/dev/null
      wait $TUNER_PID 2>/dev/null
    4. Press Esc, type :wq!, and press Enter to save the file and exit.
    5. Grant the permissions on the script.
      chmod 777 memory.sh
  8. Execute the script to analyze the memory access statistics of the application.
    ./memory.sh
    Figure 3 Memory access statistics analysis report

    According to the analysis report, the reference bottleneck value for DDRC read bandwidth provided by the tool is 12,500 MB/s, while the actual DDRC read bandwidth of node 0 exceeds 50,000 MB/s. If the DDRC read bandwidth exceeds the bottleneck value during application execution, the code related to memory reads should be optimized. It is recommended to use the miss event analysis function of the System Profiler to check whether the memory hit ratio is low.

  9. Analyze miss events of the application.
    ./devkit tuner miss /home/demo/ddrc_before
    Figure 4 Miss event analysis report

    According to the analysis report, the LLC miss rate of the MemoryBoundBench function is abnormally high. Based on the high DDRC read bandwidth observed in the analysis, check whether the source code has any of the following typical causes of a low cache hit ratio:

    1. A large number of randomly accessed arrays or linked lists exist.
    2. Frequent memory copying or serialization occurs.
    3. The algorithm lacks locality, resulting in cache misses.
  10. Check the source file to identify the issue in the code.
    vim /home/demo/memory_bound.c
    Figure 5 Source file

    A large number of random data accesses occur in the source code, and memory access is not properly constrained. As a result, memory access becomes highly random, which may lead to cross-die and cross-chip accesses and a low cache hit ratio.