QE Performance Optimization

Quantum ESPRESSO (QE) is based on first-principles density functional theory (DFT). It primarily includes two modules: Plane-Wave Self-Consistent Field (PWscf) and Car–Parrinello Molecular Dynamics (CPMD).

Advantages
QE can calculate materials' Fermi surfaces (metal), electron-phonon coupling, superconducting properties (including isotropic and anisotropic superconducting features). It features a modular design that makes it easy to add new modules. QE is open source and free to use, and it is user-friendly with a dedicated post-processing package PP.
Disadvantages
There are many types of pseudopotentials for the same atom. It is difficult to find appropriate pseudopotentials for each element in multi-component compounds. Additionally, vc-relax calculations are extremely slow.

For more information about QE, visit http://www.quantum-espresso.org/.

Since 2016, NVIDIA and QE developers have jointly developed a GPU-accelerated QE version written in CUDA Fortran. This version employs a hybrid approach, using CUDA Fortran for GPU computations and MPI + OpenMP for CPU parallelization. The latest release is version 7.0. QE-GPU is considered stable and receives regular updates. Below is a list of modules that support GPU acceleration.

The most widely used software package of QE is PWscf. This package solves the Kohn-Sham (K-S) equations based on plane waves and iterates until a self-consistent field is formed. It is an iterative method used to solve multiple interdependent equations. From a simplified perspective, the process can be seen as solving the first equation, then the second, then the third, and then using the solution of the third equation to solve the first one again—this time more accurately. Initially, the results may fluctuate significantly, but by continuously repeating this cycle, the results gradually converge. This is called self-consistency. Each iteration involves multiple steps and problems to solve, which typically require extensive linear algebra operations and FFT. The following figure shows the algorithm procedure.

The main computation patterns of the steps with green blocks are as follows:

A. 3D-FFT + matrix operation + LAPACK operation

B. 3D-FFT + matrix operation

C. 3D-FFT

The QE-GPU version offers two main parallelization options. They can be selected at runtime and provide significant performance gains. The following describes some parallel code methods in solving the quantum mechanics (QM) equations. The code aims to compute the wavefunction (△) by solving a set of independent sub-problems called k-points. For each k-point, there is a set of plane-wave (PW) coefficients for loops on each band. QE divides all coefficients across MPI processes, reducing per-process memory usage but causing heavy inter-process communication. Therefore, when sufficient memory is available, the preferred strategy is to use k-point pooling. The k-point level is an embarrassingly parallel level. For example, we can allocate a k-point to each GPU, but there is no communication between these GPUs.

The following uses ZNO-SCF as an example to describe how to test the software stack.

Software Stack	x86 6348 (2P) 2 × A100	Kunpeng 920 (2P) 2 × A100
OS	Kylin V10	Kylin V10
Memory	16 × 16 GB	16 × 16 GB
MPI	NVIDIA GPU OpenMPI 4	NVIDIA GPU OpenMPI 4
Compiler	Intel 2021	Kunpeng GCC 9.3.1

The following figure shows the distribution of hotspot functions collected on Kunpeng. It can be seen that the GPU usage is very high (more than 80%), which is suitable for optimization.

Use NSYS to collect GPU code performance data. The result is shown in the following figure.

According to the hotspot analysis, the GPU hotspots account for 50% to 70% of the total hotspots. However, the hotspots of the GPU kernel functions are scattered, and the kernel functions that consume the most time show a low GPU usage and small grid division.

Run the nvidia-smi command to check the GPU usage. The usage of the two GPUs is not high, indicating that there is much room for optimization.

GPU Compilation Option Optimization

QE is mainly implemented in Fortran. Tests based on the compilation options in GPU Compilation Parameter Optimization show that the following options can slightly improve performance.

Compilation Option	Description
-O 0 1 2 3 4	Indicates the code optimization level. Use O4.
-Mipa	Performs process analysis and optimization.
-Munroll	Controls loop unrolling.
-Mvect	Enables automatic vectorization.
-use_fast_math	Enables vectorization, cache alignment, and flushz.
--fma	Indicates whether to enable FMA. Set --fma=true.

MPS Optimization

Performance analysis shows that the GPU utilization is relatively low, and the GPU power consumption has not reached its maximum. Therefore, the MPS is used. The MPS is a facility that enables compute kernels submitted from multiple CPU processes to execute simultaneously on the same GPU. Such overlapping can potentially enable more thorough resource use and better overall throughput. In addition, the test result shows that the load balancing of the K-S equations is optimal when the number of processes is eight.

Cross-GPU Optimization

By default, two GPUs share one CPU, which causes PCIe bus preemption and unnecessary resource conflicts. Therefore, it is necessary to bind each GPU to a separate CPU, as shown below:

mpirun --allow-run-as-root -np 8 -x CUDA_VISIBLE_DEVICES=0,2 -x OMP_NUM_THREADS=1
pw.x -nk 8 -input scf.in

Kernel Code Optimization

For the hotspot kernel code vexx_k_gpu, performance analysis shows thread divergence and insufficient number of blocks. To address these issues, move the conditional statements outside the loop and appropriately increase the number of blocks, as shown below:

Before optimization:

all_start_tmp=all_start(wegrp)
DO jbnd=jstart, jend
!$cuf kernel do (1)
DO ir = 1, nrxxs
IF (noncolin) THEN
result_nc_d(ir,1,ii) = result_nc_d(ir,1,ii) &
+ vc(ir,jbnd-jstart+1) * exxbuff(ir,jbnd-all_start_tmp+iexx_start,ikq)
result_nc_d(ir,2,ii) = result_nc_d(ir,2,ii) &
+ vc(ir,jbnd-jstart+1) * exxbuff(ir+nrxxs,jbnd-all_start_tmp+iexx_start,ikq)
ELSE
result_d(ir,ii) = result_d(ir,ii) &
+ vc(ir,jbnd-jstart+1)*exxbuff(ir,jbnd-all_start_tmp+iexx_start,ikq)
ENDIF
ENDDO
ENDDO

After optimization:

all_start_tmp=all_start(wegrp)
IF (noncolin) THEN
DO jbnd=jstart, jend
!$cuf kernel do <<<16,*,stream=cudaGetStreamDefault()>>>
DO ir = 1, nrxxs
result_nc_d(ir,1,ii) = result_nc_d(ir,1,ii) &
+ vc(ir,jbnd-jstart+1) * exxbuff(ir,jbnd-all_start_tmp+iexx_start,ikq)
result_nc_d(ir,2,ii) = result_nc_d(ir,2,ii) &
+ vc(ir,jbnd-jstart+1) * exxbuff(ir+nrxxs,jbnd-all_start_tmp+iexx_start,ikq)
ENDDO
ENDDO
ELSE
DO jbnd=jstart, jend
!$cuf kernel do <<<16,*,stream=cudaGetStreamDefault()>>>
DO ir = 1, nrxxs
result_d(ir,ii) = result_d(ir,ii) &
+ vc(ir,jbnd-jstart+1)*exxbuff(ir,jbnd-all_start_tmp+iexx_start,ikq)
ENDDO
ENDDO
ENDIF

After the preceding optimizations, the Kunpeng platform performance is multiplied, which is comparable to that of the next-generation C1 chip.

Parent topic: Comprehensive Cases