Rate This Document
Findability
Accuracy
Completeness
Readability

Performance Analysis Process

Prerequisites

  • The server and OS are running properly.
  • An SSH remote login tool has been installed on the local PC.
  • The Kunpeng Performance Boundary Analyzer and System Profiler have been installed in the target environment and are running properly.

Procedure

  1. Download the hotspot_io_before.cc file from GitHub, upload it to the /home/demo directory, and run the following command to switch to the source code directory:
    cd /home/demo
  2. Compile the source file.
    g++ -g -O2 -std=c++17 /home/demo/hotspot_io_before.cc -o /home/demo/hotspot_io_before
  3. Use random numbers to generate a large file.
    dd if=/dev/urandom of=tmp.txt bs=1M count=4096

    The tmp.txt file is generated in the current directory.

  4. View the application runtime.
    time ./hotspot_io_before tmp.txt

    After the command is executed, the overall application runtime is 16.3 seconds (4.1 seconds + 12.2 seconds), with 4.1 seconds spent on data reading and 12.2 seconds on data processing. According to the runtime analysis, data reading accounts for 25.15% of the total runtime. Since the application relies on file-driven data loading, this suggests that I/O operations are a potential performance bottleneck.

    Figure 1 Runtime
  5. Use the Kunpeng Performance Boundary Analyzer to locate issues.
    Go to the installation directory of the Kunpeng Performance Boundary Analyzer. Replace xxx in the command with the actual version.
    cd /home/ksys-x.x.x-Linux-aarch64
  6. Edit the config.yaml file to analyze hotspot functions.
    1. Open the config.yaml file.
      vim config.yaml
    2. Press i to enter the insert mode.
    3. Change the default value of enabled in the hotspot field from false to true.
    4. Press Esc, type :wq!, and press Enter to save the file and exit.
  7. Collect the application performance data.
    ./ksys collect -d 10 /home/demo/hotspot_io_before /home/demo/tmp.txt
    • -d 10 indicates that the collection duration is 10 seconds.
    • /home/demo/hotspot_io_before /home/demo/tmp.txt indicates the application whose data is to be collected and the file parameters on which the application depends.
    Figure 2 Hotspot statistics

    In the hotspot statistics, the main user-mode function process_buffer accounts for a relatively small portion of total calls (51%), whereas the kernel-mode function __arch_copy_to_user accounts for a higher proportion (29%). This kernel function corresponds to the read_file_to_buffer function in the user-mode application. The call stack of __arch_copy_to_user matches the characteristics of read system calls on the Arm platform. Therefore, it can be inferred that read_file_to_buffer issues frequent system calls. As a result, with a fixed user-mode runtime, kernel-mode overhead is relatively high, indicating potential room for optimization.

    For function calls with disproportionate execution time, you can use flame graphs in the hotspot analysis function of the System Profile to examine function call characteristics and determine whether any call stacks can be optimized.

  8. Use the System Profiler to further analyze the application.
    Switch to the installation directory of the System Profiler. Replace xxx in the command with the actual version.
    cd /home/DevKit-Tuner-CLI-x.x.x-Linux-Kunpeng
  9. Use the System Profiler to analyze hotspot functions of the application.
    ./devkit tuner hotspot --package -g -d 10 /home/demo/hotspot_io_before /home/demo/tmp.txt
    • --package indicates whether to generate a report data package. If you do not set the package name or path, a hotspot-timestamp.tar package is generated in the current directory by default.
    • -g displays the call stack information and generates an HTML flame graph file. By default, a Flamegraph-Timestamp.html file is generated in the current directory.
    • -d 10 indicates that the collection duration is 10 seconds.
    • /home/demo/hotspot_io_before /home/demo/tmp.txt indicates the application whose data is to be collected and the file parameters on which the application depends.

    By examining the hotspot function analysis report, it is observed that the two hotspot functions account for 61% and 17% of the total calls, respectively, indicating excessive I/O system call activity.

    Figure 3 Hotspot function analysis
  10. View the generated flame graph file.

    The report shows that a flame graph file with a name starting with Flamegraph is generated in the /home/DevKit-Tuner-CLI-x.x.x-Linux-Kunpeng directory. In the flame graph, 62% of the calls correspond to the user-mode computation function process_buffer, while the remaining 38% involve system calls. The 38% portion can be further divided into read system calls on the left side and memory-related system calls on the right side. According to the analysis, potential optimizations include:

    1. Read system calls
    2. Memory-related system calls

    Determine the specific optimization solution based on the source code.

    Figure 4 Flame graph
  11. Check the source file to identify the issue in the code.
    vim /home/demo/hotspot_io_before.cc
    Figure 5 Source file

    In the source code, the read_file_to_buffer function reads data using the traditional read/write mode. When processing large files, the proportion of time spent on I/O is generally high. The process_buffer function is responsible for processing the data. Therefore, optimizing the data reading process is feasible.