Performance Analysis Tool Usage

NVIDIA provides professional GPU performance analysis tools. Nsight Systems and Nsight Compute are commonly used in performance optimization.

Nsight Systems provides system-level application analysis and holistically reveals application performance. It allows you to view more information intuitively on the timeline. It provides a comprehensive view for the entire system, CPU, GPU, OS, runtime, and workload. The application performance depends on multiple factors rather than solely on the running efficiency of individual kernel functions.
Nsight Compute enables CUDA kernel-level analysis for individual kernel functions.

Figure 1 illustrates the performance analysis workflow. To begin with, Nsight Systems obtains a system-level overview of the application, eliminates system-level bottlenecks (such as unnecessary thread synchronization or data movement), and improves the system-level parallelism of algorithms. Then, Nsight Compute is used to optimize the most important CUDA kernel. The optimization result should be periodically returned to Nsight Systems to ensure that the biggest performance bottleneck is always prioritized. Otherwise, the bottleneck may have shifted, and optimizing the kernel function may not achieve the expected result.

Figure 1 Performance analysis workflow

Using Nsight Systems

Generate an analysis report.
```
nsys profile -y 1 -d 100 app
```
The parameters are described as follows:
- -y: automatic confirmation.
- -d: total collection time, in seconds.
You can add --stats=true to the command to view the system overview after the collection is complete.
Import the generated analysis report using Nsight Systems on Windows.

The version of Nsight Systems for performance data collection on Linux must be the same as that on Windows.
On the Timeline view tab page, view the CPU/GPU running time distribution to locate the bottleneck.
As shown in the following figure, the kernel function k_eam_fast is the most time-consuming. Therefore, you can focus your analysis and optimization on the kernel code.

For details, see https://docs.nvidia.com/nsight-systems/index.html.

Using Nsight Compute

Collect a single microarchitecture-related metric.
Collect the warp usage.
```
ncu --metrics sm__warps_active.avg.pct_of_peak_sustained_active --kernel-name k_eam_fast app  
```
The parameters are described as follows:
- --metrics: specifies the performance metric to be collected.
- --kernel-name k_eam_fast: collects only the metric of the k_eam_fast kernel. If this parameter is not used, the metric of all kernels is collected.
For details about GPU microarchitecture-related metrics, see https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html#nvprof-metric-comparison.
Collect all metrics.
```
ncu --set full --import-source=yes --launch-skip 1000 --launch-count 1 --kernel-name k_eam_fast --export ncu-log app
```
An analysis report is generated after the command is executed.

The parameters are described as follows:
- --set full: collects all metrics. You can run the ncu --list-sets command to list all performance analysis configuration sets supported by Nsight Compute.
- --import-source=yes: collects source code information. The prerequisite is that the NVCC compilation parameter must contain the -lineinfo option.
- --launch-skip 1000: specifies how many times the initialization phase is skipped before kernel startup.
- --launch-count 1: specifies how many times the kernel is started.
- --export ncu-log: specifies the name of the output report.
Import the generated analysis report using Nsight Compute on Windows.

Switch to the Details tab page and view the following information.

Metric	Description
GPU Speed Of Light Throughput	Overall usage of GPU compute and storage resources.
Compute Workload Analysis	Usage of SM compute resources and instructions per cycle (IPC).
Memory Workload Analysis	Usage of each level of storage.
Scheduler Statistics	Instruction transmission information of the warp scheduler.
Warp State Statistics	Warp status during kernel execution.
Instruction Statistics	Composition and execution of Streaming Assembler (SASS) instructions.
Launch Statistics	Resource configuration (grid/block/thread/warp/register/shared memory) upon kernel startup.
Occupancy	Warp usage.

Switch to the Source tab page, view the complete CUDA source code, associated SASS code, PTX assembly code, and metrics related to the code.
You can then locate the code to be optimized.

For details, see https://docs.nvidia.com/nsight-compute/index.html.

Performance Analysis Case

Analyze the performance bottlenecks of the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) application.

Import the generated analysis report to Nsight Compute.
The following information shows a low throughput of compute resources.
Check Compute Workload Analysis and Warp State Statistics. The following figure shows the proportion of each warp stall root cause.
The analysis result shows that Stall Long Scoreboard is the main cause of warp stalls.
Switch to the Source tab page to view the code that causes Stall Long Scoreboard and optimize the execution mode of the code.

Stall Long Scoreboard means that the execution of the next instruction depends on the arrival of L1TEX (local, global, surface, texture) data.

Solution: Increase data locality to improve cache hit rate, or move frequently used data to the shared memory.

For details about Warp Scheduler State, see https://docs.nvidia.com/nsight-compute/2022.2/ProfilingGuide/index.html#statistical-sampler.

Parent topic: GPU Optimization Methodology