OpenMP Parallelization

Principles

OpenMP is an application programming interface (API) for platform-independent parallel programming with shared memory using C, C++, and Fortran languages. As an API, it provides a high-level abstraction layer on low-level multithreaded primitives. Its programming models and interfaces can be transplanted to various compilers and hardware architectures, and can be extended on any number of CPU cores. Therefore, it is applicable to a wide range of scenarios, from desktop PCs with several CPU cores to computing nodes in supercomputers with up to hundreds of cores. Modern compilers (GCC 4.4+/Clang 3.8+) support OpenMP by default, and you do not need to install dependency libraries.

OpenMP uses special, comment-like compiler directives (often called pragma) to enhance sequence code and provide compilers with hints about how to parallelize code. Simply add the appropriate pragma and you can easily use it to parallelize existing sequence code.

The parallelization algorithm needs to be redesigned in the following scenarios:

The data processing sequence cannot be dependent on each other. Parallelization means that different data is processed at the same time. If the data to be processed at one point of time depends on the previous processing result, the parallelization cannot be performed.
There is no race condition in the scope of parallelization. OpenMP is a framework running on a shared memory architecture where each thread can access any variable or array declared outside of the scope of parallelization. If multiple threads perform operations on the same data at the same time, the execution result is random. Locks deteriorate the performance, which is contrary to the aim of parallelization.

OpenMP uses the fork-join execution model. At the beginning, there is only one main thread. When parallel computing is required, multiple branch threads are derived to execute parallel tasks. When the parallel code execution is complete, the branch threads merge and the control process is handed over to the independent main thread. Figure 1 shows a typical fork-join execution model.

Figure 1 fork-join execution model

Modification Method

The most commonly used function of OpenMP is the for loop. The following uses matrix multiplication as an example.

Include the OpenMP header file, for example, include <omp.h>.
Add a pragma before the for loop to be executed in parallel. The following figure shows an OpenMP parallelization case.

By default, the variables preceding the scope of parallelization are shared, and the variables defined in the scope of parallelization are private. Loop variables following the pragma are private. Check whether other variables in the scope of parallelization need to be set to private to avoid competition.

In this example, variables i, j, k, and sum are private variables, and arrays a, b, and ans are public variables without contention.
Compile the program. The compiler flag -fopenmp specifies whether the compiler should support the OpenMP API. The compile command is as follows:
1
g++ -O2 -std=c++11 -fopenmp matrix_multiplication.cpp -o matrix_multiplication
Execute the program. During runtime, the default number of threads is typically equal to the number of logical CPU cores detected by the OS. Use the environment variable OMP_NUM_THREADS to modify it. (You can also use the dedicated interface set_num_threads() or dedicated clause to further specify the number of threads in the source code.) The command is as follows:
1
OMP_NUM_THREADS=2 ./matrix_multiplication

In this example, the test is performed on Kunpeng processors. When the size is 3000, the serial execution time is 56s. When two OpenMP threads are executed in parallel, that is, cycles 0 to 1499 and cycles 1500 to 2999 are executed in parallel in the two threads, the execution time is 38s. The performance is greatly improved.

Parent topic: Optimization of Hot Functions