Optimizing Cacheline

Principles

The CPU identifies whether the data in the cache is valid data in the unit of cacheline instead of memory bit width. This mechanism may cause false sharing, which reduces the cache hit ratio of the CPU. The common cause of false sharing is that frequently accessed data is not aligned based on the cacheline size.

The cache space is divided into different cachelines, as shown in Figure 1. Although readHighFreq is not rewritten, it is read from the memory in the cache when false sharing occurs.

Figure 1 Cache space division

For example, two variables are defined in the following code. The two variables are in the same cacheline, and the cache reads the variables at the same time.

1	int readHighFreq, writeHighFreq

readHighFreq is a variable with a high read frequency, and writeHighFreq is a variable with a high write frequency. After writeHighFreq is rewritten in a CPU core, data of a cacheline length corresponding to the cache is identified as invalid, that is, readHighFreq is identified as invalid data by the CPU core. Although readHighFreq is not modified, when the CPU accesses readHighFreq, the data is still imported from the memory. As a result, false sharing occurs and the performance deteriorates.

The cacheline size of Kunpeng 920 servers is different from that of x86 servers. The performance of programs optimized on x86 servers may be low when they run on Kunpeng 920 servers. In this case, you need to modify the memory alignment size of service code. The cacheline size of the x86 L3 cache is 64 bytes, and that of the Kunpeng 920 is 128 bytes.

Modification Method

Modify the service code so that the data that is frequently read and written is aligned based on the cacheline size. For details, see the following:
1. The alignment method for dynamically applying for memory is as follows:
  1
  int posix_memalign(void **memptr, size_t alignment, size_t size)
  When the posix_memalign function is successfully called, the dynamic memory of size bytes is returned, and the start address of the memory is a multiple of alignment.
2. Local variables can be padded as follows:
  1 2
  int writeHighFreq; char pad[CACHE_LINE_SIZE - sizeof(int)];
  In the code, CACHE_LINE_SIZE indicates the size of the cacheline on the server. The pad variable is used to fill the remaining space of the writeHighFreq variable. The sum of the two values is the size of the cacheline.
Some open-source software code has the cacheline macro definition. You only need to change the macro value. For example, the CACHE_LINE_SIZE macro is used in Impala to indicate the cacheline size of the target platform.

Parent topic: Optimization Methods