CacheLine
CPU在读取数据的时候,不是一个byte一个byte读取的,而是按照CacheLine为单位读取的。CPU标识Cache中的数据是否为有效数据不是以内存位宽为单位,而是以CacheLine为单位。这个机制可能会导致伪共享(false sharing)现象,从而使得CPU的Cache命中率变低。出现伪共享的常见原因是高频访问的数据未按照CacheLine大小对齐。
Cache空间大小划分成不同的CacheLine,示意图如图1所示,readHighFreq虽然没有被改写,且在Cache中,在发生伪共享时,也是从内存中读。
例如以下代码定义两个变量,会在同一个Cacheline中,Cache会同时读入:
int readHighFreq, writeHighFreq
其中readHighFreq是读频率高的变量,writeHighFreq为写频率高的变量。writeHighFreq在一个CPU core里面被改写后,这个cache 中对应的Cacheline长度的数据被标识为无效,也就是readHighFreq被CPU core标识为无效数据,虽然readHighFreq并没有被修改,但是CPU在访问readHighFreq时,依然会从内存重新导入,出现伪共享导致性能降低。
鲲鹏920处理器和x86的CacheLine大小不一致,可能会出现在x86上优化好的程序在鲲鹏920处理器上运行时的性能偏低的情况,需要重新修改业务代码数据内存对齐大小。x86 L3 cache的CacheLine大小为64字节,鲲鹏920处理器的CacheLine为128字节。
CacheLine对齐编程方法
对于读写频繁的数据需要以CacheLine大小对齐,修改方法有两种,分别是使用动态申请内存和填充。
使用动态申请内存的对齐方法:
int posix_memalign(void **memptr, size_t alignment, size_t size)
调用posix_memalign函数成功时会返回size字节的动态内存,并且这块内存的起始地址是alignment的倍数。
局部变量可以采用填充的方式:
int writeHighFreq; char pad[CACHE_LINE_SIZE - sizeof(int)];
代码中CACHE_LINE_SIZE是服务器CacheLine的大小,pad变量没有用处,用于填充writeHighFreq变量余下的空间,两者之和是CacheLine大小。
MySQL中针对x86平台做了很多CacheLine 64字节对齐,由于鲲鹏920处理器的L3 CacheLine为128字节,因此需要将MySQL源码中的对齐方式修改为128字节。通过修改下列MySQL数据结构CacheLine对齐为128字节,TPM提升至3%~4%。
- brt_search_latches btr_search_sys
- ReadView::m_view_list
- trx_sys_t::rw_trx_list
- trx_sys_t::mysql_trx_list
- trx_sys_t:: rsegs
- srv_conc_t::n_active
- srv_conc_t::n_active
- lock_sys_t::mutex
- lock_sys_t::wait_mutex
CacheLine对齐实现示例
未考虑CacheLine对齐代码:
#define TIME_S 99999999 #define NUM_THREADS 4 struct foo { int x; int y; }; static struct foo f; static struct foo testf; /* The two following functions are running concurrently: */ void *inc_b1(void) { cpu_set_t mask; cpu_set_t get; CPU_ZERO(&mask); CPU_SET(0, &mask); if(sched_setaffinity(0, sizeof(mask), &mask) == -1){ printf("warning: could not set CPU affinity, coutinuing...\n"); } CPU_ZERO(&get); if(sched_getaffinity(0, sizeof(get), &get) == -1){ printf("warning: could not get thread affinity, coutinuing.. \n"); } if(CPU_ISSET(0, &get)){ printf("inc_b1 is runing in %d\n", get); } for (int i = 0; i < TIME_S; ++i) ++testf.y; } int *sum_a1(void) { cpu_set_t mask; cpu_set_t get; CPU_ZERO(&mask); CPU_SET(1, &mask); if(sched_setaffinity(0, sizeof(mask), &mask) == -1){ printf("warning: could not set CPU affinity, coutinuing...\n"); } CPU_ZERO(&get); if(sched_getaffinity(0, sizeof(get), &get) == -1){ printf("warning: could not get thread affinity, coutinuing.. \n"); } if(CPU_ISSET(1, &get)){ printf("sum_a1 is runing in %d\n", get); } int s = 0; for (int i = 0; i < TIME_S; ++i) s += testf.x; return s; } int *sum_a(void) { cpu_set_t mask; cpu_set_t get; CPU_ZERO(&mask); CPU_SET(2, &mask); if(sched_setaffinity(0, sizeof(mask), &mask) == -1){ printf("warning: could not set CPU affinity, coutinuing...\n"); } CPU_ZERO(&get); if(sched_getaffinity(0, sizeof(get), &get) == -1){ printf("warning: could not get thread affinity, coutinuing.. \n"); } if(CPU_ISSET(2, &get)){ printf("sum_a is runing in %d\n", get); } int s = 0; for (int i = 0; i < TIME_S; ++i) s += f.x; return s; } void *inc_b(void) { cpu_set_t mask; cpu_set_t get; CPU_ZERO(&mask); CPU_SET(3, &mask); if(sched_setaffinity(0, sizeof(mask), &mask) == -1){ printf("warning: could not set CPU affinity, coutinuing...\n"); } CPU_ZERO(&get); if(sched_getaffinity(0, sizeof(get), &get) == -1){ printf("warning: could not get thread affinity, coutinuing.. \n"); } if(CPU_ISSET(3, &get)){ printf("inc_b is runing in %d\n", get); } for (int i = 0; i < TIME_S; ++i) ++f.y; } int main(){ int ret = 0; pthread_t tids[NUM_THREADS]; printf("start the threads\n"); ret = pthread_create(&tids[0], NULL, sum_a, NULL); if(ret != 0){ printf("pthread_create error: error_code = %d\n", ret); } ret = pthread_create(&tids[1], NULL, inc_b, NULL); if(ret != 0){ printf("pthread_create error: error_code = %d\n", ret); } ret = pthread_create(&tids[2], NULL, sum_a1, NULL); if(ret != 0){ printf("pthread_create error: error_code = %d\n", ret); } ret = pthread_create(&tids[3], NULL, inc_b1, NULL); if(ret != 0){ printf("pthread_create error: error_code = %d\n", ret); } pthread_join(tids[0], NULL); return 0; }
执行时间为2.955s。
根据CacheLine对齐原则修改:
调整foo结构体,将x和y成员变量分开在两个不同的CacheLine中,避免伪共享,执行时间缩短到2.248s。
struct foo { int x; char padx[124]; int y; char pady[124]; };