Instruction Throughput Maximization

Arithmetic intensity essentially refers to the number of mathematical operations performed during memory access. To maximize arithmetic intensity, you can perform optimization as follows:

Minimize the use of low-throughput arithmetic instructions.

For details about the relationship between arithmetic instructions and throughput, see CUDA C++ Programming Guide.

Table 1 lists the typical arithmetic instructions and throughput for A100 (compute capability 8.0).

**Table 1** Relationship between arithmetic instructions and throughput of A100
Arithmetic Instruction	Throughput
16-bit floating-point addition, multiplication, and FMA	256
32-bit floating-point addition, multiplication, and FMA	64
64-bit floating-point addition, multiplication, and FMA	32
32-bit floating-point reciprocal, reciprocal square root, __log2f, exp2f, __sinf, and __cosf	16
Maximum and minimum	64
Bitwise NOT operation on a 32-bit integer.	16
32-bit AND, OR, and XOR	64

Use low-precision types and internal computing APIs if the final result is not affected.
For example, Shuffle instructions are a group of instructions for warps. The most important feature of Shuffle instructions is that registers in a warp can access each other. Without Shuffle instructions, threads can access each other's registers only through the shared memory during communication. After Shuffle instructions are used, threads in a warp can directly access the registers of other threads, reducing the memory access latency. Most memory access operators, such as softmax, batch_norm, and reduce, are implemented using Shuffle. The following is the code for performing the reduce operation on 32 threads in a warp.
```
__device__ __forceinline__ float warpReduceSum(float sum){
    if(blockSize >= 32)sum += __shfl_down_sync(0xffffffff,sum,16);
    if(blockSize >= 16)sum += __shfl_down_sync(0xffffffff,sum,8);
    if(blockSize >= 8)sum += __shfl_down_sync(0xffffffff,sum,4);
    if(blockSize >= 4)sum += __shfl_down_sync(0xffffffff,sum,2);
    if(blockSize >= 2)sum += __shfl_down_sync(0xffffffff,sum,1);
    return sum;
}
```
In addition, CUDA supports many common mathematical functions, such as sin, cos, and exp. These functions trade a small loss in precision—typically 2 to 3 bits less than standard C math libraries—for significantly faster performance.

For details about the internal function list, see CUDA C++ Programming Guide.

Trade precision for speed without affecting the final result by using single-precision floats instead of double-precision doubles.
Minimize the use of flow control instructions, that is, reduce conditional branches. For example:
1. The control condition depends only on the value of threadIdx/warpSize.
2. Use #pragma unroll to unroll loops.
Use __fdividef (x, y) single-precision floating-point division as much as possible, which is more efficient than the division operator.
In most cases, rsqrtf() is more efficient than 1.0/sqrtf().
When using trigonometric functions, use the single-precision value type and ensure that the x value range is small.
Use bitwise operations to replace integer division and modulo operations.
Use half2 instead of half for half-precision operations.
When calling an API, ensure that the parameter types are the same. Otherwise, type conversion is required, which consumes resources. For example, if the function parameter is of the int type, and the input is of the char or short type, the input needs to be converted to the int type.
Optimize registers, for example, using fine-grained bitwise operations on the GPU.

Parent topic: Kernel Code Optimization