Sample 2: Detecting and Tuning Column-wise Access Loops
Introduction

A two-dimensional array is stored in rows, that is, X00, X01, X02, X03, ..., X10, X11, X12, X13, ... When obtaining data from the memory, the processor obtains the data in blocks of fixed length (64 bytes, called cache lines). For example, to obtain X00, the processor also reads X01, X02 ... after X00. When X01 is required, the processor can obtain data from the cache instead of from the memory. Therefore, reading data row by row can reduce the memory access time and improve the performance. If the data is read by column, the cached data may not contain X10 when X00 is read. As a result, the access goes to the memory when X10 is required.
Environment Preparations
- Check whether a compatible OS is installed on the server and the GCC version is 7.3.0 or later. Use the Kunpeng DevKit Compatibility Checker to view the details.
- Check that the Kunpeng DevKit System Profiler has been installed on the server.
- Download the code samples from GitHub and run the following command to grant the read, write, and execute permissions to all users.
The sample code files are cache_hit.c and cache_miss.c.
chmod 777 cache_hit.c cache_miss.c
Checking the Cache_miss Array Access Tool
- Prepare the program.Compile cache_miss.c and grant the read, write, and execute permissions to all users.
gcc -g cache_miss.c -o cache_miss && chmod 777 cache_miss
- Use the hotspot function analysis to analyze the cache_miss program and locate hotspot functions and instructions.
Click
next to the System Profiler and select General analysis. On the task creation page that is displayed, select Hotspot Function, set the required parameters, and click OK to start the hotspot function analysis task.Figure 1 Creating a hotspot function analysis task
Table 1 Task parameters Parameter
Description
Analysis Type
Set it to Hotspot functions.
Analysis Object
Set it to Application. In this sample, the program in the hotspot function analysis is identified. Therefore, select the cache_miss program.
Mode
Set it to Launch application.
Application Path
Enter the absolute path of the application. In this sample, the sample code is stored in the /opt/testdemo/cache/cache_miss/cache_miss directory on the server. In this example directory, the first cache_miss is a folder, and the second cache_miss is an executable program.
Sampling Duration (s)
Set it to 30.
Call Stack
Enable this option.
Sampling Range
Set it to User mode. The sampling range can be user mode, kernel mode, or all. In this sample, all CPU resources are consumed in user mode. Therefore, select User Mode.
dwarf
Enable this option.
C/C++ Source File Directory
Associates the source code during collection. Example: /opt/testdemo/cache/cache_miss/
Other Parameters
Retain their default values.
- View the analysis results.Figure 2 Summary of the hotspot function analysis result
In the displayed hotspot functions, you can see that the main function of the cache_miss program occupies all clock cycles. Click a function name in blue to view the number of lines of the function in the source code.
Optimization Solution
- Prepare the program.
Compile cache_hit.c and grant the read, write, and execute permissions to all users.
gcc -g cache_hit.c -o cache_hit && chmod 777 cache_hit
- Use the hotspot function analysis to analyze the program.Click
next to the System Profiler and select General analysis. On the task creation page that is displayed, select Hotspot Function, set the required parameters, and click OK to start the hotspot function analysis task.
Figure 3 Creating a hotspot function analysis task
Table 2 Task parameters Parameter
Description
Analysis Type
Set it to Hotspot functions.
Analysis Object
Set it to Application.
Mode
Set it to Launch application.
Application Path
Enter the absolute path of the application. In this sample, the sample code is stored in the /opt/testdemo/cache/cache_hit/cache_hit directory on the server. In the example directory, the first cache_hit is a folder, and the second cache_hit is an executable program.
Sampling Duration (s)
Set it to 30.
Call Stack
Enable this option.
Sampling Range
Set it to User mode. The sampling range can be user mode, kernel mode, or all. In this sample, all CPU resources are consumed in user mode. Therefore, select User Mode.
dwarf
Enable this option.
C/C++ Source File Directory
Associates the source code during collection. Example: /opt/testdemo/cache/cache_hit/
Other Parameters
Retain their default values.
- View the analysis results.Figure 4 Summary of the hotspot function analysis result
Compare the hotspot function analysis results of the cache_hit and cache_miss programs. The optimized program running time and number of cycles are far less than those before tuning.
The value is obtained based on the sampling data and may vary in different environments.
Result Analysis
After the loop body is optimized, the execution time of the cache_hit function is significantly reduced.