Theoretical Computation Limit

To determine the potential benefits of an optimization, this section describes two laws for theoretical evaluation.

Amdahl's Law — strong scalability

Strong scalability refers to reducing code execution time by increasing the number of streaming multiprocessors (SMs) while keeping the problem scale fixed.

Amdahl's Law explains how improving the performance of a single part of a system affects the overall system performance. The parameters in the formula are described as follows:
- S: indicates the speedup.
- P: indicates the proportion of code that can run in parallel.
- N: indicates the number of SMs for parallel execution.
According to the formula, a larger value of P leads to a larger value of S, meaning a greater potential for optimization. Conversely, a small value of P results in limited optimization gains regardless of the amount of computing resources applied.
Gustafson's Law — weak scalability

Weak scalability refers to scenarios where the problem scale may increase with the addition of resources.

Gustafson's Law can be used for theoretical analysis using the same formula parameters as Amdahl's Law.

According to the formula, a larger value of N leads to a larger problem scale. Before an optimization, first analyze your problem scenario and then conduct evaluation using the formula appropriate to the scenario.

Metrics are restricted by hardware during optimization. The following describes the theoretical peak of each metric.

Single-precision floating-point operations per second (FLOPS) for A100 (twice that of FP64 and one-fourth that of FP16):
1410 MHz (kernel clock) × 1 (GPU) × 108 (SMs) × 64 (floating-point compute units/SM) × 2 (operations/cycle) = 19.5 TFLOPS
Memory bandwidth for A100:
1 (GPU) × 5120 bits × 1215 MHz (memory clock) × 2 (DDR)/8 bits = 1555 GB/s

Generally, a memory bandwidth utilization of 40%–60% is considered average, 60%–75% is relatively high, and above 75% is extremely high. The performance of most GPU code is limited by the memory bandwidth utilization. It is worth mentioning Little's Law expressed as L = λW,

where:

L: indicates the average number of items in the system.
W: indicates the average time an item spends in the system.
λ: indicates the arrival rate of items entering the system.

The number of items entering the system per second is λ, and each item stays in the system for W seconds. Therefore, the system contains L items.

The average number of items in the system is the number of items that enter the system during the stay of an item in the system, which equals the arrival rate of items entering the system multiplied by the average time each item spends in the system.

Likewise, in GPU code optimization, the amount of useful data transferred is equal to the product of the number of bytes transferred per unit time, bandwidth utilization, and transfer time. To obtain a better optimization result, you should either reduce the average latency of each memory transaction or increase the bandwidth.

Table 1 lists the access time of different storage types.

**Table 1** Access time of different storage types
Storage Type	Register	Shared Memory	Texture Memory	Constant Memory	Global Memory
Bandwidth	~8 TB/s	~1.5 TB/s	~200 MB/s	~200 MB/s	~200 MB/s
Latency	1 clock cycle	1–32 clock cycles	400–600 clock cycles	400–600 clock cycles	400–600 clock cycles

The preceding performance data is based on A100. For details about the performance of other graphics cards, see https://sysrqmts.com/gpus/compare/nvidia-a10-pcie-vs-nvidia-a100-sxm4-40-gb.

Parent topic: GPU Optimization Methodology