CPU Cache

The CPU reads data from the memory, stores the data in the register, and then performs computing. With development of computers, the access speed difference between the memory and register becomes larger, and a program has high overheads when reading data from the memory, affecting overall performance of the computer. To solve this problem, the CPU cache is added between memory and registers. Because the access speed of reading data from the CPU cache by the register is faster than that of directly reading data from the memory, adding the CPU cache can effectively improve the overall performance of the computer. Currently, the CPU cache has three layers: L1, L2, and L3. CPU Cache shows the storage hierarchy.

Figure 1 CPU cache

L1 is the CPU cache layer closest to the register. It has the fastest access speed among L1 to L3, but its capacity is relatively small. In addition, different from L2 and L3, L1 is divided into an instruction cache and a data cache. The instruction cache is dedicated to storing instructions, and the data cache is dedicated to storing data. The instruction cache and the data cache are not mixed. This design is based on the fact that L1 is closest to the register and has the smallest capacity. If no division is made, instructions may be read to L1 and then be replaced by data to be operated. After the instructions are replaced, the instructions need to be obtained from L2, which is not desirable.

The access speed of L2 is slower than that of L1, but the L2 capacity is larger. L2 is not divided into the two caches. Note that each CPU core has its own L1 cache and L2 cache.

L3 has the largest capacity but also the slowest access speed among the three layers. Different from L1 and L2, L3 is shared by multiple CPU cores.

When the CPU needs data to perform an operation, the CPU first searches for the data from L1. If the data cannot be found, the CPU reads the data from L2 and L3 in sequence. If no data is read, the data is read from the memory. This case is called a cache miss.

Kunpeng and x86 servers use the three-layer cache. However, the cache size and access speed of each layer are different. For program development, if data can be accessed in L1, the program performance can be improved. Sound array access sequence, structure member alignment, and structure layout optimization are effective methods to improve program performance based on the CPU cache principle.

Array Access Sequence

Arrays are continuous data and are stored continuously in the memory. If arrays are accessed based on the storage sequence of the arrays in the memory, the cache hit ratio can be effectively improved, thereby improving performance. The following uses two-dimensional arrays as an example. The arrays are traversed. When the two-dimensional array B is added to the array A by column, the column elements are inconsecutive in the cache. Due to the size limitation of L1, after an element is obtained and calculated at L1, the element in the next column cannot be obtained at L1. In the next calculation, data needs to be read from L2, L3, or even the memory. The performance is relatively poor. After the array access sequence is adjusted by adding row elements of two-dimensional arrays A and B, data can be continuously read from the cache. It takes 235,722 μs after rearrangement, reduced from the previous 544,939 μs.

Read by column:

for (int i = 0; i < ARRAYLEN; i++) {
    for (int j = 0; j < ARRAYLEN; j++) {
        arrayC[i][j] = arrayA[i][j] + arrayB[j][i];
    }
}

Read by row:

for (int i = 0; i < ARRAYLEN; i++) {
    for (int j = 0; j < ARRAYLEN; j++) {
        array[i][j] = arrayA[i][j] + arrayB[i][j];
    }
}

Structure Member Alignment

The principle for aligning data members of a structure is as follows: The first member is placed at the position where the offset is 0, and the start position for storing each subsequent member must be an integer multiple of the size of the first member. See the following example. The total size of the structure, that is, the result of sizeof, must be an integer multiple of the maximum member in the structure. If the size is insufficient, supplement it.

struct Test1{
char a;
double b;
int c;
short d;
}Test1;

According to the principle, the storage position in the memory for Test1 is shown in Figure 2. The memory positions occupied by the structure are 0 to 23. However, the memory positions of 1 to 7, 22, and 23 are padded.

Figure 2 Structure memory distribution (before adjustment)

There is another structure Test2. Test1 and Test2 have related functions, but the sequence of Test2 member variables is adjusted.

struct Test2{
double b;
int c;
short d;
char a;
}Test2;

According to the structure member alignment principle, the storage position in the memory for Test2 is shown in Figure 3. The memory positions occupied by the structure are 0 to 15, and only the position 15 is padded.

Figure 3 Structure memory distribution (after adjustment)

According to the preceding example, Test2 saves one third of the memory space compared with Test1 for implementing the same functions during programming. This greatly improves the memory utilization and data compactness, facilitates L1 cache hit, and improves the performance.

Structure Layout Design

For the structure layout, various programming layouts bring different effects. The following describes two data organization modes: structure array and array structure. The structure array is defined as follows:

Structure array:

struct Array{
        int x;
        int y; 
};

struct Array stArray[N];

In this case, each group of x and y is stored continuously. The storage format in the memory is as follows:

The array structure is defined as follows:

Array structure:

struct Array{
        int x[N];
        int y[N]; 
};

struct Array stArray;

In this case, x and y are stored separately. The storage format in the memory is as follows:

When operations mainly involve x in the service scenario, compared with the structure array, the data loaded to the memory and cache from x in the array structure is continuous, which can improve the cacheline validity and cache hit ratio, thereby improving the performance. Therefore, you are advised to design an appropriate structure layout based on the actual scenario during programming.

Parent topic: Cache and Prefetch