GPU Usage Maximization

Application level
Use asynchronous functions and streams to maximize the parallelism of host tasks, device tasks, and host-device communication tasks.
```
cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
cudaStreamCreate(&stream[i]);
float *hostPtr;
cudaMallocHost(&hostPtr, 2 * size);
for (int i = 0; i < 2; ++i) {
cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]);
MyKernel<<<100, 512, 0, stream[i]>>>(outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]);
}
for (int i = 0; i < 2; ++i)
cudaStreamDestroy(stream[i]);
```
Unlike cudaMemcpy that blocks the CPU until the host memory copy is complete, cudaMemcpyAsync is used in the preceding example, allowing the CPU to continue to work during memory transfer. This example uses CUDA streams. Data copy and kernel computation operations are submitted to different streams to implement asynchronous concurrent execution. For example, one stream asynchronously transfers host data to the GPU memory, while the other stream executes computation tasks in parallel on the GPU. After the computation is complete, the result can be asynchronously returned to the host memory. This multi-stream concurrent scheduling mechanism effectively implements data transfer and computation overlap, significantly improving GPU usage and the overall program performance. If the kernel is small, this mode can make full use of the GPU. If no stream is specified, the tasks are executed in stream 0 by default. If a stream is specified, the parallel efficiency can be improved. In the following examples, if each operation can be completed within 1s, their running time is the same.

Currently, most systems use PCIe for communication between the CPU and GPU, while only a few use NVLink. The cost of communication is usually high, but asynchronous data transfer can improve efficiency.

The following code example creates two streams, each of which is used to copy the host memory to the device and copy the device memory to the host in parallel.
```
cudastream_t s1,s2;
cudastreamCreate(&s1); cudaStreamCreate(&s2):
# Execution time: 3s
cudaMemcpy(&d_arr, &h._arr,numbytes,cudaH2D);
A<<<1,128>>>(d_arr);
cudaMemcpy(&h_arr,&d_arr,numbytes,cudaD2H);
# Execution time: 3s
cudaMemcpyAsync(&d_arr,&h_arr,numbytes,cudaH2D,s1);
A<<<1,128, s1>>>(d_arr);
cudaMemcpyAsync(&h_arr,&d_arr, numbytes,cudaD2H,s1);

# Execution time: 3s
cudaMemcpyAsync(&d_arr1,&h_arr1, numbytes,cudaH2D,s1);
A<<<1, 128, s1>>>(d_arr1);
cudaMemcpyAsync(&h_arr1,&d_arr1, numbytes, cudaD2H, s1);
cudaMemcpyAsync(&d_arr2,&h_arr2, numbytes,cudaH2D,s2);
B<<<1,192, s2>>>(d_arr2);
cudaMemcpyAsync(&h-arr2,d_arr2,numbytes,cudaD2H, cudaH2D, s1);

# Execution time: 3s
cudaMemcpyAsync(&d_arr1,&h_arr1,numbytes,cudaH2D,s1);
cudaMemcpyAsync(&d_arr2,&h_arr2, numbytes , cudaH2D,s2);  
A<<<1,128,s1>>>(d_arr1);
B<<<1,192,s2>>>(d_arr2);
cudaMemcpyAsync(&h_arr1,&d_arr1,numbytes,cudaD2H,s1);
cudaMemcpyAsync(&h_arr2,&d_arr2,numbytes, cudaD2H,s2);
```

Device level
There is no data dependency. Parallel kernels use stream acceleration.
Multi-threading
Exercise caution when using registers and shared memory in kernel code writing to prevent the occupancy from being affected.

The occupancy of each SM on A100 is subject to the following restrictions:
- Maximum number of thread blocks: 2
- Maximum number of threads that can run in each thread block: 1,024
- Maximum number of registers shared by all thread blocks: 65,536
- Maximum shared memory size: 20 KB
- Block size: a multiple of the warp size
For GPUs, assigning more work to a single thread can better hide latency. More work per thread provides the compiler with greater opportunities to reorder instructions, which helps conceal the high latency of memory accesses. Therefore, setting the block size as a multiple of the warp size can maximize the GPU performance.

Parent topic: Kernel Code Optimization