LAMMPS Performance Optimization

LAMMPS contains various potential files, which can be used to simulate soft matter (biomolecules and polymers), solid-state materials (metals and semiconductors), and coarse-grained algorithms and mesoscopic systems. It can perform molecular simulation at the atomic, mesoscopic, or continuum scale. LAMMPS has powerful parallel processing capabilities and can perform large-scale molecular dynamics simulations on tens of thousands of CPU cores.

LAMMPS integrates Newton's equations of motion for all atoms and molecules in the system that interact with each other. To improve efficiency, LAMMPS uses neighbor lists to keep track of nearby particles. These lists are optimized for systems with short-range repulsive forces to prevent local particle densities from becoming too high. On parallel computers, LAMMPS employs a spatial decomposition technique that divides the simulation domain into smaller regions, each assigned to a different processor.

LAMMPS Algorithm Analysis

The LAMMPS calculation can be divided into the following four steps:

Find an appropriate interatomic potential (force field) file, or download an existing potential file from the following website.
https://www.ctcms.nist.gov/potentials/
Edit the input file.
Calculate. The calculation can be performed on a large-scale GPU cluster.
Post-process the calculation result.

The following uses the metal case as an example. The test server is equipped with two Kunpeng 920 processors and 2 NVIDIA A100 GPUs. The software stack is as follows:

Item	Version
OS	Kylin V10
Memory	16 × 16 GB
MPI	Hyper MPI
Compiler	BiSheng 2.4.0

MPS Optimization

Performance analysis shows that the GPU utilization is relatively low, and the GPU power consumption has not reached its maximum. Therefore, the MPS is used. The MPS is a facility that enables compute kernels submitted from multiple CPU processes to execute simultaneously on the same GPU. Such overlapping can potentially enable more thorough resource use and better overall throughput. In addition, the test result shows that the load balancing is optimal when the number of processes is 30. Using MPS can also enable powerful scaling of applications across multiple GPUs, through more efficient overlapping of hardware resource utilization and better exploitation of CPU-based parallelism.

GPU Compilation Option Optimization

Compilation Option	Description
-O 0 1 2 3 4	Indicates the code optimization level. Use -O4.
-use_fast_math	Enables vectorization, cache alignment, and flushz.
--fmad	Indicates whether to enable FMA. Set --fmad=true.

Cross-GPU Optimization

By default, two GPUs share one CPU, which causes PCIe bus preemption and unnecessary resource conflicts. Therefore, it is necessary to bind each GPU to a separate CPU, as shown in the following figure.

Kernel Code Optimization

Performance analysis of the hotspot kernel (k_eam_fast) code shows data dependency (Stall Long Scoreboard) and repeated calculations.

Nsight Compute shows that data dependency exists in lines 535 and 536. Move the variables to the shared memory for processing.

Before optimization:

for ( ; nbor<nbor_end; nbor+=n_stride) {
      int j=dev_packed[nbor];
j &= NEIGHMASK;

After optimization:

for ( ; nbor<nbor_end; nbor+=n_stride) {
      __shared__int j;
j=dev_packed[nbor];
j &= NEIGHMASK;

Reduce repeated calculations.

Before optimization:

f.x+=delx*force;
f.y+=dely*force;
f.z+=delz*force;
if (EVFLAG && eflag) {
energy += phi;
}
if (EVFLAG && vflag) {
virial[0] += delx*delx*force;
virial[1] += dely*dely*force;
virial[2] += delz*delz*force;
virial[3] += delx*dely*force;
virial[4] += delx*delz*force;
virial[5] += dely*delz*force;
}

After optimization:

acctyp4 t1;
t1.x=delx*force;
t1.y=dely*force;
t1.z=delz*force;
f.x+=t1.x;
f.y+=t1.y;
f.z+=t1.z;
if (EVFLAG && eflag) {
energy += phi;
}
if (EVFLAG && vflag) {
virial[0] += delx*t1.x;
virial[1] += dely*t1.y;
virial[2] += delz*t1.z;
virial[3] += dely*t1.x;
virial[4] += delx*t1.z;
virial[5] += delz*t1.y;
}

After the preceding optimizations, the Kunpeng platform performance is significantly improved.

Parent topic: Comprehensive Cases