High Cache Miss Rate
Use the micro-architecture analysis of the System Profiler to determine the processes and code that cause high cache miss rate.
Cache Optimization
- Cache line alignment
Reference: https://www.hikunpeng.com/document/detail/en/perftuning/tuningtip/kunpengtuning_12_0052.html
- Eliminating false sharing
False sharing occurs when multiple CPUs modify their private variables stored in the same cache line, causing the cache line invalidate for other CPUs and forcing other CPUs to refresh it. This phenomenon is similar to the read and write of shared variables, but they are not truly shared variable, so it is called false sharing. See Figure 1.
CPU0' and CPU1's private variables happen to be located in the same cache line (private variables correspond to red and blue blocks respectively). CPU0's modification of its private variables will invalidate the entire cache line, and CPU1 will need to read its private variables from memory again to access them, which reduces efficiency.
Solution:
- In OpenMP code, use the reduction clause instead of writing directly to shared variables (writing thread private variables during the loop).
- Align private variables of threads by cache line size (except for variables on the thread stack).
- Use thread private variables (for example, GCC supports the __thread keyword, while C11 supports the _Thread_local keyword).
- Data rearrangement
Data rearrangement is the process of turning physically discontiguous hotspot data into contiguous data, allowing CPU access by cache line and improving cache hit ratio. For example, in matrix multiplication, assuming that the matrix is stored by rows, the column elements of matrix B are read discontinuously and the cache hit rate is low. (For ease of understanding, the following figure assumes that the total size of the row/column elements in matrix B is the same as the size of the cache block.)
Figure 2 Matrix multiplication 1
The cache hit ratio is improved by rearranging matrix B. The column elements can be read continuously from the L1 cache.
Figure 3 Matrix multiplication 2
- Using software prefetch
Software prefetch means that the data to be used later is loaded into the cache in advance by prefetch memory (PRFM) instructions. This avoids increasing the memory access latency caused by cache miss. As shown in Figure 4, data at addr2 is prefetched in advance. After addr1 is executed, data at addr2 is ready.
The __builtin_prefetch() function can be used in GCC. The function prototype is __builtin_prefetch(const void addr, int rw, int locality), where addr indicates the address where data will be prefetched, rw indicates the operation to be performed on the cache line located at addr, (0 indicates read, which is the default operation, while 1 indicates write). locality indicates the frequency of accessing the cache line after prefetch. It can be set to 0, 1, 2, or 3. 0 means that the cache line is accessed only once and should not reside. 3 means that the cache line will be frequently accessed and should reside in caches of all levels.

