CacheLine
CPU在读取数据的时候,不是一个byte一个byte读取的,而是按照CacheLine为单位读取的。CPU标识Cache中的数据是否为有效数据不是以内存位宽为单位,而是以CacheLine为单位。这个机制可能会导致伪共享(false sharing)现象,从而使得CPU的Cache命中率变低。出现伪共享的常见原因是高频访问的数据未按照CacheLine大小对齐。
Cache空间大小划分成不同的CacheLine,示意图如图1所示,readHighFreq虽然没有被改写,且在Cache中,在发生伪共享时,也是从内存中读。
例如以下代码定义两个变量,会在同一个CacheLine中,Cache会同时读入:
int readHighFreq, writeHighFreq
其中readHighFreq是读频率高的变量,writeHighFreq为写频率高的变量。writeHighFreq在一个CPU core里面被改写后,这个Cache中对应的CacheLine长度的数据被标识为无效,也就是readHighFreq被CPU core标识为无效数据,虽然readHighFreq并没有被修改,但是CPU在访问readHighFreq时,依然会从内存重新导入,出现伪共享导致性能降低。
鲲鹏920处理器和x86的CacheLine大小不一致,可能会出现在x86上优化好的程序在鲲鹏920处理器上运行时的性能偏低的情况,需要重新修改业务代码数据内存对齐大小。x86 L3 Cache的CacheLine大小为64字节,鲲鹏920处理器的CacheLine为128字节。
CacheLine对齐编程方法
对于读写频繁的数据需要以CacheLine大小对齐,修改方法有两种,分别是使用动态申请内存和填充。
使用动态申请内存的对齐方法:
int posix_memalign(void **memptr, size_t alignment, size_t size)
调用posix_memalign函数成功时会返回size字节的动态内存,并且这块内存的起始地址是alignment的倍数。
局部变量可以采用填充的方式:
int writeHighFreq; char pad[CACHE_LINE_SIZE - sizeof(int)];
代码中CACHE_LINE_SIZE是服务器CacheLine的大小,pad变量没有用处,用于填充writeHighFreq变量余下的空间,两者之和是CacheLine大小。
MySQL中针对x86平台做了很多CacheLine 64字节对齐,由于鲲鹏920处理器的L3 CacheLine为128字节,因此需要将MySQL源码中的对齐方式修改为128字节。通过修改下列MySQL数据结构CacheLine对齐为128字节,TPM提升至3%~4%。
- brt_search_latches btr_search_sys
- ReadView::m_view_list
- trx_sys_t::rw_trx_list
- trx_sys_t::mysql_trx_list
- trx_sys_t:: rsegs
- srv_conc_t::n_active
- srv_conc_t::n_active
- lock_sys_t::mutex
- lock_sys_t::wait_mutex
CacheLine对齐实现示例
未考虑CacheLine对齐代码:
#define TIME_S 99999999
#define NUM_THREADS 4
struct foo {
int x;
int y;
};
static struct foo f;
static struct foo testf;
/* The two following functions are running concurrently: */
void *inc_b1(void)
{
cpu_set_t mask;
cpu_set_t get;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
printf("warning: could not set CPU affinity, continuing...\n");
}
CPU_ZERO(&get);
if(sched_getaffinity(0, sizeof(get), &get) == -1){
printf("warning: could not get thread affinity, continuing.. \n");
}
if(CPU_ISSET(0, &get)){
printf("inc_b1 is runing in %d\n", get);
}
for (int i = 0; i < TIME_S; ++i)
++testf.y;
}
int *sum_a1(void)
{
cpu_set_t mask;
cpu_set_t get;
CPU_ZERO(&mask);
CPU_SET(1, &mask);
if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
printf("warning: could not set CPU affinity, continuing...\n");
}
CPU_ZERO(&get);
if(sched_getaffinity(0, sizeof(get), &get) == -1){
printf("warning: could not get thread affinity, continuing.. \n");
}
if(CPU_ISSET(1, &get)){
printf("sum_a1 is runing in %d\n", get);
}
int s = 0;
for (int i = 0; i < TIME_S; ++i)
s += testf.x;
return s;
}
int *sum_a(void)
{
cpu_set_t mask;
cpu_set_t get;
CPU_ZERO(&mask);
CPU_SET(2, &mask);
if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
printf("warning: could not set CPU affinity, continuing...\n");
}
CPU_ZERO(&get);
if(sched_getaffinity(0, sizeof(get), &get) == -1){
printf("warning: could not get thread affinity, continuing.. \n");
}
if(CPU_ISSET(2, &get)){
printf("sum_a is runing in %d\n", get);
}
int s = 0;
for (int i = 0; i < TIME_S; ++i)
s += f.x;
return s;
}
void *inc_b(void)
{
cpu_set_t mask;
cpu_set_t get;
CPU_ZERO(&mask);
CPU_SET(3, &mask);
if(sched_setaffinity(0, sizeof(mask), &mask) == -1){
printf("warning: could not set CPU affinity, continuing...\n");
}
CPU_ZERO(&get);
if(sched_getaffinity(0, sizeof(get), &get) == -1){
printf("warning: could not get thread affinity, continuing.. \n");
}
if(CPU_ISSET(3, &get)){
printf("inc_b is runing in %d\n", get);
}
for (int i = 0; i < TIME_S; ++i)
++f.y;
}
int main(){
int ret = 0;
pthread_t tids[NUM_THREADS];
printf("start the threads\n");
ret = pthread_create(&tids[0], NULL, sum_a, NULL);
if(ret != 0){
printf("pthread_create error: error_code = %d\n", ret);
}
ret = pthread_create(&tids[1], NULL, inc_b, NULL);
if(ret != 0){
printf("pthread_create error: error_code = %d\n", ret);
}
ret = pthread_create(&tids[2], NULL, sum_a1, NULL);
if(ret != 0){
printf("pthread_create error: error_code = %d\n", ret);
}
ret = pthread_create(&tids[3], NULL, inc_b1, NULL);
if(ret != 0){
printf("pthread_create error: error_code = %d\n", ret);
}
pthread_join(tids[0], NULL);
return 0;
}
执行时间为2.955s。
根据CacheLine对齐原则修改:
调整foo结构体,将x和y成员变量分开在两个不同的CacheLine中,避免伪共享,执行时间缩短到2.248s。
struct foo {
int x;
char padx[124];
int y;
char pady[124];
};
