Performance Analysis Process

Prerequisites

The server and OS are running properly.
An SSH remote login tool has been installed on the local PC.
The Kunpeng Performance Boundary Analyzer and System Profiler have been installed in the target environment and are running properly.

Procedure

Download the cpu_branch_prediction_before.cpp file from GitHub, upload it to the /home/demo directory, and run the following command to switch to the source code directory:
```
cd /home/demo
```

Compile the source file.

g++ -o cpu_branch_prediction_before cpu_branch_prediction_before.cpp

View the application runtime.
```
time /home/demo/cpu_branch_prediction_before
```
After the command is executed, the application runtime is 61 seconds. You can use the Kunpeng Performance Boundary Analyzer to check for performance issues in the application. If any optimization opportunities are found, update the source code accordingly. The current application runtime is used as the baseline for measuring time overhead.

Figure 1 Runtime
Use the Kunpeng Performance Boundary Analyzer to locate issues.
Go to the installation directory of the Kunpeng Performance Boundary Analyzer. Replace xxx in the command with the actual version.
```
cd /home/ksys-x.x.x-Linux-aarch64
```
Collect the application performance data.
```
./ksys collect /home/demo/cpu_branch_prediction_before
```
/home/demo/cpu_branch_prediction_before indicates the application whose data is to be collected.

Figure 2 Microarchitecture statistics

In the microarchitecture statistics, a high value of Branch Mispredicts(%) under Bad Speculation(%) indicates the number of branch prediction errors that occur on the CPU during program execution. In the CPU pipeline, a branch prediction error introduces significant time overhead. Specifically, a higher Branch Mispredicts value corresponds to more branch mispredictions, greater CPU overhead, and poorer performance. The Top-Down metrics collected by the Kunpeng Performance Boundary Analyzer are a subset of the microarchitecture analysis function of the System Profiler. For more detailed metrics, it is recommended to use the microarchitecture analysis function.
Use the System Profiler to further analyze the program.
Switch to the installation directory of the System Profiler. Replace xxx in the command with the actual version.
```
cd /home/DevKit-Tuner-CLI-x.x.x-Linux-Kunpeng
```
Use the System Profiler to analyze the microarchitecture of the application.
```
./devkit tuner top-down -d 30 /home/demo/cpu_branch_prediction_before
```
- -d 30 indicates that the collection duration is 30 seconds.
- /home/demo/cpu_branch_prediction_before indicates the application whose data is to be collected.
Figure 3 Microarchitecture analysis report

Check the microarchitecture analysis report. The proportion of Other Branch under Branch Mispredicts is high, indicating that the Branch Mispredicts value is high, consistent with the results from the Kunpeng Performance Boundary Analyzer. This suggests that the program experiences many CPU branch mispredictions, so you need to pay attention to conditional and unconditional jumps in the source code.
The meaning of each branch and common application scenarios of source code are as follows:
- Indirect Branch: The jump target address is not immediate but is obtained from a register or memory. It is commonly used in virtual functions, function pointers, and switch jump tables.
- Push Branch: The assembly command is call for x86 and bl for Arm. It is commonly used in function calls.
- Pop Branch: The assembly command is ret for both x86 and Arm. It is commonly used in function returns.
- Other Branch: Conditional and unconditional direct jumps are commonly used in constructs such as if-else, for, while, do-while, ternary operators, as well as break, continue, and goto statements.
Check the source file to identify the issue in the code.
```
vim /home/demo/cpu_branch_prediction_before.cpp
```
Figure 4 Source file

In the source code, an if-else check is performed on each generated random number. If the numbers are not sorted, the CPU experiences a high branch misprediction rate, which leads to longer delays due to frequent policy switching.

Parent topic: Practice 1: Microarchitecture Analysis