Cacheline

When the CPU reads data, it does not read data byte by byte. Instead, it reads data by cacheline. The CPU identifies whether the data in the cache is valid data in the unit of cacheline instead of memory bit width. This mechanism may cause false sharing, which reduces the CPU cache hit ratio. The common cause of false sharing is that frequently accessed data is not aligned based on the cacheline size.

The cache space is divided into different cachelines, as shown in Figure 1. Although readHighFreq is not modified and is in the cache, it is read from the memory when false sharing occurs.

Figure 1 Cache space

For example, two variables are defined in the following code. The two variables are in the same cacheline, and the cache reads the variables at the same time.

int readHighFreq, writeHighFreq

readHighFreq is a variable with a high read frequency, and writeHighFreq is a variable with a high write frequency. After writeHighFreq is modified in a CPU core, data of a cacheline length corresponding to the cache is identified as invalid, that is, readHighFreq is identified as invalid data by the CPU core. Although readHighFreq is not modified, when the CPU accesses readHighFreq, the data is still imported from the memory. As a result, false sharing occurs and the performance deteriorates.

The cacheline size of the Kunpeng 920 processor is different from that of the x86 processor. As a result, the performance of an optimized program on x86 may be low when the program runs on Kunpeng 920. In this case, you need to modify the memory alignment size of the service code. The cacheline size of the x86 L3 cache is 64 bytes, and the cacheline size of Kunpeng 920 is 128 bytes.

Cacheline Alignment Programming

Data that is frequently read and written needs to be aligned based on the cacheline size. There are two modification methods: dynamic memory application and padding.

The alignment method for dynamically applying for memory is as follows:

int posix_memalign(void **memptr, size_t alignment, size_t size)

When the posix_memalign function is successfully called, the dynamic memory of size bytes is returned, and the start address of the memory is a multiple of alignment.

Local variables can be padded as follows:

int writeHighFreq;
char pad[CACHE_LINE_SIZE - sizeof(int)];

In the code, CACHE_LINE_SIZE indicates the size of the cacheline on the server. The pad variable is used to fill the remaining space of the writeHighFreq variable. The sum of the two values is the size of the cacheline.

The MySQL source code has made much 64-byte cacheline alignment for the x86 platform. The L3 cacheline of the Kunpeng 920 processor is 128 bytes. Therefore, the alignment mode for the MySQL source code needs to be changed to 128 bytes. The TPM is improved by 3% to 4% after the cacheline of the MySQL data structure is changed to 128 bytes.

brt_search_latches btr_search_sys
ReadView::m_view_list
trx_sys_t::rw_trx_list
trx_sys_t::mysql_trx_list
trx_sys_t:: rsegs
srv_conc_t::n_active
srv_conc_t::n_active
lock_sys_t::mutex
lock_sys_t::wait_mutex

Cacheline Alignment Implementation

Code before cacheline alignment:

#define TIME_S 99999999
#define NUM_THREADS 4

struct foo {
    int x;
    int y;
};

static struct foo f;
static struct foo testf;
/* The two following functions are running concurrently: */

void *inc_b1(void)
{
    cpu_set_t mask;
    cpu_set_t get;
    CPU_ZERO(&mask);
    CPU_SET(0, &mask);
    if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
        printf("warning: could not set CPU affinity, coutinuing...\n");
    }
    CPU_ZERO(&get);
    if(sched_getaffinity(0, sizeof(get), &get) == -1){
        printf("warning: could not get thread affinity, coutinuing.. \n");
    }
    if(CPU_ISSET(0, &get)){
       printf("inc_b1 is runing in %d\n", get);
    }
    for (int i = 0; i < TIME_S; ++i)
        ++testf.y;
}

int *sum_a1(void)
{
    cpu_set_t mask;
    cpu_set_t get;
    CPU_ZERO(&mask);
    CPU_SET(1, &mask);
    if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
        printf("warning: could not set CPU affinity, coutinuing...\n");
    }
    CPU_ZERO(&get);
    if(sched_getaffinity(0, sizeof(get), &get) == -1){
        printf("warning: could not get thread affinity, coutinuing.. \n");
    }
    if(CPU_ISSET(1, &get)){
       printf("sum_a1 is runing in %d\n", get);
    }
    int s = 0;
    for (int i = 0; i < TIME_S; ++i)
        s += testf.x;
    return s;
}

int *sum_a(void)
{
    cpu_set_t mask;
    cpu_set_t get;
    CPU_ZERO(&mask);
    CPU_SET(2, &mask);
    if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
        printf("warning: could not set CPU affinity, coutinuing...\n");
    }
    CPU_ZERO(&get);
    if(sched_getaffinity(0, sizeof(get), &get) == -1){
        printf("warning: could not get thread affinity, coutinuing.. \n");
    }
    if(CPU_ISSET(2, &get)){
       printf("sum_a is runing in %d\n", get);
    }
    int s = 0;
    for (int i = 0; i < TIME_S; ++i)
        s += f.x;
    return s;
}

void *inc_b(void)
{
    cpu_set_t mask;
    cpu_set_t get;
    CPU_ZERO(&mask);
    CPU_SET(3, &mask);
    if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
        printf("warning: could not set CPU affinity, coutinuing...\n");
    }
    CPU_ZERO(&get);
    if(sched_getaffinity(0, sizeof(get), &get) == -1){
        printf("warning: could not get thread affinity, coutinuing.. \n");
    }
    if(CPU_ISSET(3, &get)){
       printf("inc_b is runing in %d\n", get);
    }
    for (int i = 0; i < TIME_S; ++i)
        ++f.y;
}
int main(){
    int ret = 0;
    pthread_t tids[NUM_THREADS];
    printf("start the threads\n");
    ret = pthread_create(&tids[0], NULL, sum_a, NULL);
    if(ret != 0){
        printf("pthread_create error: error_code = %d\n", ret);
    }
    ret = pthread_create(&tids[1], NULL, inc_b, NULL);
    if(ret != 0){
        printf("pthread_create error: error_code = %d\n", ret);
    }

    ret = pthread_create(&tids[2], NULL, sum_a1, NULL);
    if(ret != 0){
        printf("pthread_create error: error_code = %d\n", ret);
    }
    ret = pthread_create(&tids[3], NULL, inc_b1, NULL);
    if(ret != 0){
        printf("pthread_create error: error_code = %d\n", ret);
    }
    pthread_join(tids[0], NULL);
    return 0;
}

The execution time is 2.955s.

New code based on the cacheline alignment principle:

The foo structure is adjusted. The x and y member variables are separated in two different cachelines to avoid false sharing. The execution time is shortened to 2.248s.

struct foo {
    int x;
    char padx[124];
    int y;
    char pady[124];
};

Parent topic: Cache and Prefetch