Using Hotspot Function Analysis

Use the hotspot function analysis of the System Profiler to determine which hotspot functions have performance bottlenecks.

High Percentage of IDR Instructions

Enable CPU prefetch.

When the CPU reads data from the memory to the cache, the CPU reads the data to be accessed this time and also prefetches adjacent data to the cache. If the prefetched data is the data to be accessed next time, the performance is improved; if not, the memory bandwidth is wasted.

In scenarios where data is centralized, the prefetch hit rate is high and CPU prefetch is recommended. If data is not centralized and the prefetch hit rate is low, the memory bandwidth will be wasted.

Function Inlining

Frequently called small functions are inlined. Function inlining is a typical space-for-time optimization method.

Reference: https://www.hikunpeng.com/document/detail/en/perftuning/tuningtip/kunpengtuning_12_0085.html

NEON Instruction Acceleration

Tune NEON instructions by using vectorization.

Reference: https://www.hikunpeng.com/document/detail/en/perftuning/tuningtip/kunpengtuning_12_0053.html

Loop Optimization

Loop optimization is to optimize the code of the loop used in the program. It helps to utilize the processor's computing units, improve the scheduling efficiency of the instruction pipeline, and increase the cache hit ratio. There are many loop optimization methods, such as loop unrolling, loop fusion, loop splitting, loop permutation, and loop tiling.

Loop unrolling
Loop unrolling is to duplicate the loop body for multiple times to decrease the number of operations the loop condition is tested. Since the Kunpeng processor has multiple instruction execution units, you can increase the computing density to improve scheduling efficiency of the instruction pipeline.

Small loops without internal judgment logic have higher gains; large loops may cause general-purpose register overflow and compromise performance (register renaming/overflow to memory); internal judgment logic may increase the overhead of branch prediction, requiring case-by-case analysis.
Loop fusion
Loop fusion combines the bodies of adjacent loops to reduce operations on loop variables. Iteration variables are used in the loop body to improve the cache usage. The fusion of small loops also helps increase the probability of out-of-order execution by the TaiShan processor and improve the scheduling efficiency of the instruction pipeline.
Loop fission
Loop fission attempts to break complex or computing-intensive loop into multiple loops to improve register utilization.
Loop interchange
Loop interchange is to exchange the nested sequence of loops so that memory access can meet the principle of locality to improve the cache hit ratio.
Loop tiling
Loop tiling reorganizes a loop into a group of nested loops, each internal loop responsible for a small data block. It helps to utilize the existing data in the cache and improve the cache hit ratio. This method is usually used for large datasets.

Reference: https://www.hikunpeng.com/document/detail/en/perftuning/tuningtip/kunpengtuning_12_0083.html

Branch Optimization

Optimizing branch predictions
Make full use of pre-compiled instructions such as likely() and unlikely() to increase the accuracy of conditional branch predictions and shorten the instruction running path.

Optimizing logical expressions

**Table 1** Logical expression optimization solutions
Scenario	Solution
if ((i < 4) \|\| (i & 1)) { ... }	If i is more likely to be 1, changing to bit manipulation results in better performance: if ((i & 1) \|\| (i < 4)) { ... }.
if ((strlen(p) > 4) && (*p == 'y')) { ... }	If p is less likely to be y, the cost of the strlen function is higher. A better method is: *if ((p == 'y') && (strlen(p) > 4)) { ...** }.
if (a <= max && a >= min && b <= max && b >= min)	If most of the data is not within the range, change it to if (a > max \|\| a < min \|\| b > max \|\| b < min).

Replacing if-else with switch for multiple branches
Multi-branch conditional statements generally use switch statements, so that the program is better than the original program both in terms of clarity and efficiency.

Cache Optimization

Cache line alignment
Reference: https://www.hikunpeng.com/document/detail/en/perftuning/tuningtip/kunpengtuning_12_0052.html
Eliminating false sharing
False sharing occurs when multiple CPUs modify their private variables stored in the same cache line, causing the cache line invalidate for other CPUs and forcing other CPUs to refresh it. This phenomenon is similar to the read and write of shared variables, but they are not truly shared variable, so it is called false sharing. For details, see False sharing.

Figure 1 False sharing

CPU0' and CPU1's private variables happen to be located in the same cache line (private variables correspond to red and blue blocks respectively). CPU0's modification of its private variables will invalidate the entire cache line, and CPU1 will need to read its private variables from memory again to access them, which reduces efficiency.

Solution:
- In OpenMP code, use the reduction clause instead of writing directly to shared variables (writing thread private variables during the loop).
- Align private variables of threads by cache line size (except for variables on the thread stack).
- Use thread private variables (for example, GCC supports the __thread keyword, while C11 supports the _Thread_local keyword).
Data rearrangement
Data rearrangement is the process of turning physically discontiguous hotspot data into contiguous data, allowing CPU access by cache line and improving cache hit ratio. For example, in matrix multiplication, assuming that the matrix is stored by rows, the column elements of matrix B are read discontinuously and the cache hit rate is low. (For ease of understanding, the following figure assumes that the total size of the row/column elements in matrix B is the same as the size of the cache block.)

Figure 2 Matrix multiplication 1

The cache hit ratio is improved by rearranging matrix B. The column elements can be read continuously from the L1 cache.

Figure 3 Matrix multiplication 2
Using software prefetch
Software prefetch means that the data to be used later is loaded into the cache in advance by prefetch memory (PRFM) instructions. This avoids increasing the memory access latency caused by cache miss. As shown in Figure 4, data at addr2 is prefetched in advance. After addr1 is executed, data at addr2 is ready.

Figure 4 Software prefetch

The __builtin_prefetch() function can be used in GCC. The function prototype is __builtin_prefetch(const void addr, int rw, int locality), where addr indicates the address where data will be prefetched, rw indicates the operation to be performed on the cache line located at addr, (0 indicates read, which is the default operation, while 1 indicates write). locality indicates the frequency of accessing the cache line after prefetch. It can be set to 0, 1, 2, or 3. 0 means that the cache line is accessed only once and should not reside. 3 means that the cache line will be frequently accessed and should reside in caches of all levels.

Structure Optimization

Byte alignment
Start with alignment. For example, the data width of the bus is 64 bits (aligning 8 bytes), burst size = 64 bits and burst length = 4, if you want to write 32 bytes to address 0xf0002000, only one transfer is required. If you want to write 32 bytes to 0xf0002001, you have to split it into 2 unaligned transfers, which is obviously less efficient than one transfer.

Figure 5 Byte alignment

Although Kunpeng CPUs currently support unaligned accesses, using non-byte alignment for structures that are to be accessed frequently can cause inefficiencies in reading and writing. Different load/store instructions have different requirements on aligned access. An example of using byte alignment in GCC is as follows, with the default 8-byte alignment mode.
```
struct person1{
    char *name;
    int age;
    char score;
    int id;
}__attribute__((packed));
 
struct person2{
    char *name;
    int age;
    char score;
    int id;
}__attribute__((aligned(4)));
```
Figure 6 GCC byte alignment

Keeping the structure byte aligned makes the CPU read data from memory to cache with lower latency.
Member sequence adjustment
For a large structure, if two members in the structure are widely spaced across two cache lines, and a hotspot function needs to access these two members frequently, it may cause a large number of cache misses. (Modifying one member may cause the cache line where the other member is located to be replaced out.) You can put the two members in the same cache line to improve the cache utilization.

Reducing Unnecessary Barriers

Modern CPUs generally implement out-of-order execution to improve the execution efficiency of the pipeline, meaning that the instruction order of the code may not be the same as the actual execution order of the instructions. If two instructions have no register dependency, the execution order cannot be determined. However, there are some scenarios where these two instructions are expected to be order-preserving, for example, after writing a value to a memory address, it is necessary to wait until the writing is completed (writing to memory takes time) and then write data to a peripheral register, because this register address has no data dependency with the memory address and the CPU cannot guarantee the execution order, so the barrier instruction is provided for the programmer to manually perform order-preserving.

Figure 7 Manual order-preserving

Armv8 provides three barrier instructions: DMB, DSB, and ISB.

DMB: data memory barrier

DSB: data synchronization barrier

ISB: instruction synchronization barrier, which is stricter than DMB.

The execution cost of ISB is the highest, then DSB, and then DMB. From the perspective of performance, do not use barriers if possible. If the DMB can be used, do not use the DSB or ISB. Note that the Load-Acquire/Store-Release instruction is added to Armv8, which implies the barrier operation.

Memory Access Optimization

Reducing unnecessary memory reads, writes and allocations
The second function, as shown Figure 8, indicates the compiler. The hash variables are temporary variables that do not need to be saved on the stack, and can be returned directly at the end of the computing. This can reduce unnecessary memory reads and writes.

Figure 8 Memory access optimization

In addition, memory allocation involves large overheads. For frequently used memory, consider creating a memory pool at initialization time and subsequently fetching memory blocks directly from the memory pool. Special care should be taken to avoid unnecessary allocation of memory in the loop.
Do not frequently use pointers to obtain data. Preferentially use local stack variables.
Figure 9 Preferentially using local stack variables
Proper use of global variables
If global variables are used in programming, the compiler will not allocate registers for global variables when the compiler optimizes into a certain function stack and loads to the corresponding code block. Using local variables can compile register-level operations during compiler optimization, and reduce redundant Load operations.

Solution:
- Use local variables unless necessary.
- In scenarios involving global variables that need to be read by multiple functions rather than modified, the global variables can be passed in as form parameters, and the const keyword can be added to avoid the global variables being modified during the function call.
- If global variables are to be called by other modules, wrap them in get/set form to avoid using global variables directly.

Multi-core Optimization

Lock optimization
Load-link/Store-conditional (LL/SC) atomic instructions load shared variables to the L1 cache where the current core is located and modify them. The performance is good when there is little lock contention. In an intense lock contention scenario, the performance deteriorates severely. The Armv8.1 specification introduces a new atomic operation instruction extension, Large System Extensions (LSE), which puts computation operations into the L3 cache to increase the scope of data sharing, reduce cache consistency time consumption, and improve lock performance when lock competition is intense.

In the case of multiple cores and severe atomic lock contention, add the LSE option to the GCC compilation options to ease lock contention.

Figure 10 LL/SC instructions (ldaxr and stlxr)

Figure 11 LSE instruction (ldaddal)

Solution:

GCC 6.0 or above is supported (GCC 7.3.0 or above is recommended), you can add -march=armv8-a+lse, -march=armv8.1-a, or -march=armv8.2-a options to the compilation options.
Reducing unnecessary barriers
Since ARMv8 provides Load-Acquire (LDAR) and Store-Release (STLR) instructions, which implicitly include barrier operations, it is important to avoid unnecessary barriers when implementing atomic operations.

Figure 12 One-way barriers

Figure 13 Load-Acquire/Store-Release instruction
Reducing cross-NUMA invoking
The Kunpeng processor is a Lego-style architecture with two NUMA nodes inside each processor socket. The relationship between the physical core where the thread or process is actually running and the location of the memory NUMA node will bring about differences in memory path latency, with local access performance being optimal, cross-NUMA once within this socket being slightly less optimal, and cross-socket access performance being the least.

Figure 14 NUMA node configuration

Solution:
- Avoid thread migration during runtime: When using OpenMP, specify the CPU cores to be bound by threads by configuring the environment variables OMP_PROC_BIND=true and OMP_PLACES.
  Figure 15 Binding threads to CPU cores
- Using the numactl, taskset, cgroup, or cpuset command to bind processes and threads to CPU cores
  Figure 16 NUMA core binding

Parent topic: Hotspot Functions