CacheLine

CPU在读取数据的时候，不是一个byte一个byte读取的，而是按照CacheLine为单位读取的。CPU标识Cache中的数据是否为有效数据不是以内存位宽为单位，而是以CacheLine为单位。这个机制可能会导致伪共享（false sharing）现象，从而使得CPU的Cache命中率变低。出现伪共享的常见原因是高频访问的数据未按照CacheLine大小对齐。

Cache空间大小划分成不同的CacheLine，示意图如图1所示，readHighFreq虽然没有被改写，且在Cache中，在发生伪共享时，也是从内存中读。

图1 Cache空间示意图

例如以下代码定义两个变量，会在同一个Cacheline中，Cache会同时读入：

int readHighFreq, writeHighFreq

其中readHighFreq是读频率高的变量，writeHighFreq为写频率高的变量。writeHighFreq在一个CPU core里面被改写后，这个cache 中对应的Cacheline长度的数据被标识为无效，也就是readHighFreq被CPU core标识为无效数据，虽然readHighFreq并没有被修改，但是CPU在访问readHighFreq时，依然会从内存重新导入，出现伪共享导致性能降低。

鲲鹏920处理器和x86的CacheLine大小不一致，可能会出现在x86上优化好的程序在鲲鹏920处理器上运行时的性能偏低的情况，需要重新修改业务代码数据内存对齐大小。x86 L3 cache的CacheLine大小为64字节，鲲鹏920处理器的CacheLine为128字节。

CacheLine对齐编程方法

对于读写频繁的数据需要以CacheLine大小对齐，修改方法有两种，分别是使用动态申请内存和填充。

使用动态申请内存的对齐方法：

int posix_memalign(void **memptr, size_t alignment, size_t size)

调用posix_memalign函数成功时会返回size字节的动态内存，并且这块内存的起始地址是alignment的倍数。

局部变量可以采用填充的方式：

int writeHighFreq;
char pad[CACHE_LINE_SIZE - sizeof(int)];

代码中CACHE_LINE_SIZE是服务器CacheLine的大小，pad变量没有用处，用于填充writeHighFreq变量余下的空间，两者之和是CacheLine大小。

MySQL中针对x86平台做了很多CacheLine 64字节对齐，由于鲲鹏920处理器的L3 CacheLine为128字节，因此需要将MySQL源码中的对齐方式修改为128字节。通过修改下列MySQL数据结构CacheLine对齐为128字节，TPM提升至3%~4%。

brt_search_latches btr_search_sys
ReadView::m_view_list
trx_sys_t::rw_trx_list
trx_sys_t::mysql_trx_list
trx_sys_t:: rsegs
srv_conc_t::n_active
srv_conc_t::n_active
lock_sys_t::mutex
lock_sys_t::wait_mutex

CacheLine对齐实现示例

未考虑CacheLine对齐代码：

#define TIME_S 99999999
#define NUM_THREADS 4

struct foo {
    int x;
    int y;
};

static struct foo f;
static struct foo testf;
/* The two following functions are running concurrently: */

void *inc_b1(void)
{
    cpu_set_t mask;
    cpu_set_t get;
    CPU_ZERO(&mask);
    CPU_SET(0, &mask);
    if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
        printf("warning: could not set CPU affinity, coutinuing...\n");
    }
    CPU_ZERO(&get);
    if(sched_getaffinity(0, sizeof(get), &get) == -1){
        printf("warning: could not get thread affinity, coutinuing.. \n");
    }
    if(CPU_ISSET(0, &get)){
       printf("inc_b1 is runing in %d\n", get);
    }
    for (int i = 0; i < TIME_S; ++i)
        ++testf.y;
}

int *sum_a1(void)
{
    cpu_set_t mask;
    cpu_set_t get;
    CPU_ZERO(&mask);
    CPU_SET(1, &mask);
    if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
        printf("warning: could not set CPU affinity, coutinuing...\n");
    }
    CPU_ZERO(&get);
    if(sched_getaffinity(0, sizeof(get), &get) == -1){
        printf("warning: could not get thread affinity, coutinuing.. \n");
    }
    if(CPU_ISSET(1, &get)){
       printf("sum_a1 is runing in %d\n", get);
    }
    int s = 0;
    for (int i = 0; i < TIME_S; ++i)
        s += testf.x;
    return s;
}

int *sum_a(void)
{
    cpu_set_t mask;
    cpu_set_t get;
    CPU_ZERO(&mask);
    CPU_SET(2, &mask);
    if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
        printf("warning: could not set CPU affinity, coutinuing...\n");
    }
    CPU_ZERO(&get);
    if(sched_getaffinity(0, sizeof(get), &get) == -1){
        printf("warning: could not get thread affinity, coutinuing.. \n");
    }
    if(CPU_ISSET(2, &get)){
       printf("sum_a is runing in %d\n", get);
    }
    int s = 0;
    for (int i = 0; i < TIME_S; ++i)
        s += f.x;
    return s;
}

void *inc_b(void)
{
    cpu_set_t mask;
    cpu_set_t get;
    CPU_ZERO(&mask);
    CPU_SET(3, &mask);
    if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
        printf("warning: could not set CPU affinity, coutinuing...\n");
    }
    CPU_ZERO(&get);
    if(sched_getaffinity(0, sizeof(get), &get) == -1){
        printf("warning: could not get thread affinity, coutinuing.. \n");
    }
    if(CPU_ISSET(3, &get)){
       printf("inc_b is runing in %d\n", get);
    }
    for (int i = 0; i < TIME_S; ++i)
        ++f.y;
}
int main(){
    int ret = 0;
    pthread_t tids[NUM_THREADS];
    printf("start the threads\n");
    ret = pthread_create(&tids[0], NULL, sum_a, NULL);
    if(ret != 0){
        printf("pthread_create error: error_code = %d\n", ret);
    }
    ret = pthread_create(&tids[1], NULL, inc_b, NULL);
    if(ret != 0){
        printf("pthread_create error: error_code = %d\n", ret);
    }

    ret = pthread_create(&tids[2], NULL, sum_a1, NULL);
    if(ret != 0){
        printf("pthread_create error: error_code = %d\n", ret);
    }
    ret = pthread_create(&tids[3], NULL, inc_b1, NULL);
    if(ret != 0){
        printf("pthread_create error: error_code = %d\n", ret);
    }
    pthread_join(tids[0], NULL);
    return 0;
}

执行时间为2.955s。

根据CacheLine对齐原则修改：

调整foo结构体，将x和y成员变量分开在两个不同的CacheLine中，避免伪共享，执行时间缩短到2.248s。

struct foo {
    int x;
    char padx[124];
    int y;
    char pady[124];
};

父主题： Cache和预取