Computation Offloading to GPUs via OpenMP Directives

OpenMP, proposed by the OpenMP Architecture Review Board, is a widely adopted compiler directive scheme for SM programming in shared-memory parallel systems. It did not support offloading computations to GPUs until OpenMP 4.0 (released in 2013), which provides a set of compiler directives to instruct the compiler and runtime to offload code blocks to devices, including GPUs.

OpenMP uses #pragma omp target to transfer data between the host and the GPU.

void kunpeng_test()
{
    int len = 16;
    int x[len], y[len], z[len];
    for(int i = 0; i < len; i++)
    {
        x[i] = i;
        y[i] = i * 2;
        z[i] = 0;
     }
    #pragma omp target map(to: x[0:len], y[0:len], len) map(tofrom: z[0:len])
    for(int index = 0, index < len; index++)
    {
         z[index] = x[index] * y[index];
    }
}

to indicates that data is transferred from the host to the device, and from indicates that data is transferred from the device to the host.

Assume that the host is a Kunpeng server and the device is a GPU. Figure 1 shows the data transfer process of the preceding code.

Figure 1 Data transfer

OpenMP uses different directives for offloading tasks to hardware devices and performing parallel operations on hardware devices. For convenience, OpenMP defines composite directives. The following code combines offloading and parallel operations.

void kunpeng_test()
{
    int len = 16;
    int x[len], y[len], z[len];
    for(int i = 0; i < len; i++)
    {
        x[i] = i;
        y[i] = i * 2;
        z[i] = 0;
     }
    #pragma omp target teams distribute parallel for simd num_teams(num_blocks) map(to: x[0:len], y[0:len], len) map(tofrom: z[0:len])
    for(index = 0, index < len; index++)
    {
         z[index] = x[index] * y[index];
    }
}

The LAMMPS GPU version uses OpenMP to offload tasks to GPUs.

Parent topic: OpenMP/OpenACC Optimization