GPU Memory Throughput Maximization

The GPU memory usage has a great impact on the GPU performance. Therefore, memory access needs to be minimized as much as possible. Figure 1 shows the GPU memory model.

Figure 1 GPU memory model

The key is to merge global memory accesses whenever possible, given the noticeable performance differences among various types of memory. You can perform optimization as follows:

Maximize the use of page-locked (pinned) memory.
This is equivalent to allocating a fixed memory on the CPU. When the GPU needs data, the data is directly copied from the pinned memory by using functions such as cudaHostAlloc and cudaHostRegister.
Minimize memory transfers between the host and device.
- Combine multiple transfers into one.
- Place low-parallelism tasks that require host-device memory transfers on the GPU.
Consider alignment for global memory access.
If the alignment is incorrect, read and write operations are split into multiple operations by the compiler, reducing the memory access performance.

For one-dimensional data, use cudaMalloc() to allocate GPU global memory space. For multi-dimensional data, you are advised to use cudaMallocPitch() to establish memory space to ensure segment alignment. The start address of the first element in each row of the array in the memory allocated by the cudaMallocPitch function is aligned. Because the number of data elements in each row is uncertain, cudaMallocPitch allocates extra bytes for each row during memory allocation to ensure that the number of elements in the x direction multiplied by the size of all elements plus the extra bytes is a multiple of 256 (for alignment).
Do not store kernel variables in the local memory.
Minimize memory operations for each thread. Note that this method may not be effective. In some cases, data needs to be quickly copied to the shared memory and then copied back. Although the total number of memory operations increases, the memory access time decreases.
For example, in the case of calculating the square sum of an array, when a thread is waiting for memory data, the GPU switches to the next thread. Therefore, the actual memory reading order is similar to thread 0 → thread 1 → ... → thread n. This means that while individual threads read memory contiguously, the overall memory access during execution is not sequential.
```
__global__ static void squaresSum(int *data, int *sum, clock_t *time)
{
    const int size = DATA_SIZE / THREAD_NUM;
    const int tid = threadIdx.x;
    int tmp_sum = 0;
    clock_t start;
    if (tid == 0) start = clock();
    for (int i = tid * size; i < (tid + 1) * size; i++) 
    {
        tmp_sum += data[i] * data[i];
    }
    sum[tid] = tmp_sum;
    if (tid == 0) *time = clock() - start;
}
```
If the array data is read every N threads, all memory operations are unified. This eliminates GPU thread switching, greatly reducing the overhead. The performance of the following code is 15 times higher than that of the preceding code.
```
__global__ static void squaresSum(int *data, int *sum, clock_t *time)
{
 const int size = DATA_SIZE / THREAD_NUM;
    const int tid = threadIdx.x;
    int tmp_sum = 0;
    clock_t start;
    if (tid == 0) start = clock();
    for (int i = tid; i < DATA_SIZE; i += THREAD_NUM)
    {
      tmp_sum += data[i] * data[i];
    }
    sum[tid] = tmp_sum;
    if (tid == 0) *time = clock() - start;
}
```

Avoid thread divergence. The efficiency of executing the same operation by adjacent threads on a GPU is higher.
For example, in a reduce operation to sum all elements of an array of length N, the algorithm typically involves three steps:
1. Load data to the shared memory.
2. Perform the reduce operation on the data in the shared memory.
3. Write the final result back to the global memory.
```
__global__ void reduce(float *d_in,float *d_out){
    __shared__ float sdata[THREAD_PER_BLOCK];

    //each thread loads one element from global memory to shared mem
    unsigned int i=blockIdx.x*blockDim.x+threadIdx.x;
    unsigned int tid=threadIdx.x;
    sdata[tid]=d_in[i];
    __syncthreads();

    // do reduction in shared mem
    for(unsigned int s=1; s<blockDim.x; s*=2){
        if(tid%(2*s) == 0){
            sdata[tid]+=sdata[tid+s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if(tid==0)d_out[blockIdx.x]=sdata[tid];
}
```
Assume that each thread block contains 256 threads (THREAD_PER_BLOCK = 256). The 256 threads are divided into eight warps, each containing 32 threads. The thread tid reads data i from the global memory and stores it in the shared memory. Then, the iteration starts.
1. First iteration: If tid%2==0, the thread tid adds the value at the tid position to the value at the tid + 1 position in the shared memory, and then stores the result at the tid position.
2. Second iteration: If tid%4==0, the thread tid adds the value at the tid position to the value at the tid + 2 position in the shared memory, and then stores the result at the tid position.
3. The iteration continues until all elements are accumulated at the position 0.
The biggest problem of the preceding code is warp divergence. All threads in a block execute the same instruction. If branches such as if-else exist, threads execute all branches. For branches whose conditions are not met, the generated results are not recorded. As shown in the preceding figure, two branches (in purple and green) are generated in each iteration. This severely affects the code execution efficiency. If all threads can go to the same branch, the efficiency should be improved.

After optimization:
```
__global__ void reduce1(float *d_in,float *d_out){
    __shared__ float sdata[THREAD_PER_BLOCK];

    //each thread loads one element from global memory to shared mem
    unsigned int i=blockIdx.x*blockDim.x+threadIdx.x;
    unsigned int tid=threadIdx.x;
    sdata[tid]=d_in[i];
    __syncthreads();

    // do reduction in shared mem
    for(unsigned int s=1; s<blockDim.x; s*=2){
        int index = 2*s*tid;
        if(index < blockDim.x){
            sdata[index]+=sdata[index+s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if(tid==0)d_out[blockIdx.x]=sdata[tid];
}
```
Although the if statement still exists, the code is different from the previous code. Assume that there are 256 threads in a block, and therefore, there are eight warps (256/32).

During the first iteration, the index of warps 0 to 3 is smaller than blockDim.x and the index of warps 4 to 7 are greater than or equal to blockDim.x. Each warp enters only one branch, so warp divergence does not occur.

During the second iteration, warps 0 and 1 enter the computation branch.

During the third iteration, only warp 0 enters the computation branch.

During the fourth iteration, only the first 16 threads of warp 0 enter the branch. At this time, warp divergence occurs. This method eliminates warp divergence in the first three iterations.

Prevent bank conflicts when using the shared memory.
The biggest problem of reduce1 is bank conflicts. Let's focus only on warp 0 in the for loop.

In the first iteration, thread 0 needs to load the data at addresses 0 and 1 in the shared memory, and write the data back to address 0. At this time, thread 16 in the warp needs to load the data at addresses 32 and 33 in the shared memory. It can be found that a two-way bank conflict occurs between addresses 0 and 32.

In the second iteration, thread 0 needs to load the data at addresses 0 and 2 in the shared memory. Thread 8 in this warp needs to load the data at addresses 32 and 34, thread 16 needs to load the data at addresses 64 and 68, and thread 24 needs to load the data at addresses 96 and 100. As addresses 0, 32, 64, and 96 correspond to the same bank, a four-way bank conflict occurs.

If iterations continue, the 8-way and 16-way bank conflicts will occur. Due to bank conflicts, the performance of reduce1 is limited. The following figure shows the bank conflicts when the first data is loaded.

To solve bank conflicts, reverse the for loop. The stride changes from ascending (0 to 256) to descending (128 to 0).
```
__global__ void reduce2(float *d_in,float *d_out){
    __shared__ float sdata[THREAD_PER_BLOCK];

    //each thread loads one element from global memory to shared mem
    unsigned int i=blockIdx.x*blockDim.x+threadIdx.x;
    unsigned int tid=threadIdx.x;
    sdata[tid]=d_in[i];
    __syncthreads();

    // do reduction in shared mem
    for(unsigned int s=blockDim.x/2; s>0; s>>=1){
        if(tid < s){
            sdata[tid]+=sdata[tid+s];
        }
        __syncthreads();
    }

    // write result for this block to global mem
    if(tid==0)d_out[blockIdx.x]=sdata[tid];
}
```
Let's see how this change eliminates bank conflicts.

Let's analyze the for loop for only warp 0. Thread 0 needs to load elements 0 and 128 in the shared memory. Thread 1 needs to load elements 1 and 129 in the shared memory. In this iteration, when the first number is read, the 32 threads in the warp load a row of shared memory data.

In the second iteration, thread 0 loads elements 0 and 64, and thread 1 loads elements 1 and 65. Each thread loads a row of shared memory data.

In the third iteration, thread 0 loads elements 0 and 32 and so on. Each thread loads a row of shared memory data, and no bank conflict occurs.

In the fourth iteration, thread 0 loads elements 0 and 16.

The following figure shows the iteration process.

For more information about reduce function optimization, see https://github.com/Liu-xiandong/How_to_optimize_in_GPU.

Access the global memory continuously as much as possible.

For example, N × N threads are started to calculate the product of two N × N matrices. The value of each element in the result matrix is calculated using the following formula:

During the calculation of C_ij, B_kj involves non-contiguous data access, which affects the memory access efficiency. In this case, matrix B^T can be introduced to ensure contiguous data access.

Before optimization, each thread calculates only one element with non-contiguous access to array B.

__global__ void matrix_mul_gpu(int n, float *a, float*b, float*c)
{
   const int bid = blockIdx.x;
   const int tid = threadIdx.x;
      float s = 0.0;
        for (int k = 0; k < n; ++k)
           s += a[bid*n + k] * b[k*n + tid];               //A(i,k)*B(k,j)
         c[bid*n+tid] = s;
}

After optimization, both matrices A and B are accessed contiguously, resulting in a 30% performance improvement.

__global__ void transpose(int n, float *b, float *bt)
{
bt[blockIdx.x*n + threadIdx.x] = b[threadIdx.x*n + blockIdx.x];
}
__global__ void matrix_mul_gpu_opt(int n, float *a, float*b, float*c)
{
const int bid = blockIdx.x;
const int tid = threadIdx.x;
float s = 0.0;
for (int k = 0; k < n; ++k)
  s+= a[bid*n + k] * b[tid*n + k];          // A × BT (contiguous access)
 c[bid*n + tid] = s;
}

Parent topic: Kernel Code Optimization