System Optimization Methods

Generally, GPU optimization involves two challenges: how to use CUDA to reconstruct CPU code and how to accelerate existing CUDA code. The Assess, Parallelize, Optimize, Deploy (APOD) process is developed to meet these two challenges. It requires parallelizing only the most critical performance bottleneck at a time. For example, if a legacy codebase or project has 10 slow-running spots, you first identify the single most significant bottleneck affecting performance and temporarily ignore the other nine. This allows you to observe tangible optimization results each time, avoiding the accumulation of too many optimization tasks at once.

Assess: For a project, the first step is to assess the code by measuring the execution time of each part. With this information, developers can identify the bottleneck to be parallelized and begin attempts to accelerate the code using the GPU. The upper limit of the optimized parallel performance can be determined based on the Amdahl's Law and Gustafson's Law mentioned in the previous section.
Parallelize: After identifying the bottleneck and setting optimization goals and expectations, the code can be optimized. Calling some GPU acceleration libraries (such as cuBLAS, cuFFT, and Thrust) may be effective. In some cases, developers need to refactor the code to expose parts that can be parallelized.
Optimize: After determining the parts to be parallelized, you need to consider the specific implementation method. Generally, there is only one implementation method, so you need to have a thorough understanding of the application requirements. APOD is an iterative process: identify optimization points, implement and test optimization, verify results, and then repeat the process. For developers, it is unnecessary to find strategies that solve all performance bottlenecks at the beginning. Optimization can be performed at different levels, and using performance analysis tools is helpful.
Deploy: The general principle is to deploy the optimized code to the production environment immediately after completing each optimization, rather than searching for other parts for optimization. There are many important reasons for doing so. For example, users can quickly benefit from the optimization. The iterative approach helps reduce risks and ensure online stability.

Parent topic: GPU Optimization Methodology