Compiler Optimization Methods
Compiler optimization options can complete code optimization in the compile stage. The following lists common optimization options and their principles, which can greatly improve program running performance in some scenarios.
Instruction Sets and Pipelines
During C/C++ code compilation, the compiler translates the source code into an instruction sequence that can be identified by the CPU and writes the instruction sequence into the binary file of the executable program. The CPU usually executes an instruction in a pipeline to improve performance. Therefore, the instruction execution sequence greatly affects pipeline efficiency. Generally, factors such as the quantity of hardware resources for instruction computing, execution periods of different instructions, and data dependency between instructions need to be considered in an instruction pipeline. You can obtain better instruction sequence orchestration by notifying the compiler of the target platform (CPU) instruction set and pipeline.
GCC 9.1.0 supports the Armv8 instruction set and pipeline that are compatible with the Kunpeng processor.
Usage:
Add compilation options to CFLAGS and CPPFLAGS on the GCC for openEuler, BiSheng Compiler, and GCC (version 9.1.0 or later).
-mtune=tsv110 -march=armv8-a
Optimization Levels
The compiler uses -O to control program optimization levels, as shown in Table 1.
|
Level |
Description |
|---|---|
|
-O |
The default value is -O0, which does not add any optimization options. It enables the fastest compilation speed, and the program is debugging-friendly. |
|
-O1 |
Add common optimization options. |
|
-O2 |
Add more optimization options compared with -O1. |
|
-O3 |
This is the highest level of optimization. It takes a long time to compile and generate programs that can be executed faster. |
|
-Ofast |
Add the same optimization options as -O3 and some non-standard optimization options. |
|
-Os |
Add the same optimization options as -O2 and some options to reduce program code. |
|
-Og |
Add debugging information. |
GCC uses gcc -Q --help=optimizers -O2 to view the optimization options selected for each optimization level.

The compiler has many optimization options. The following are some examples of scenarios where the compiler can be optimized.
- Function inlining:
You can use the compilation option -finline-functions to enable function inlining. By default, function inlining is enabled at the -O2, -O3, and -Os optimization levels. Function inlining reduces the function calling and return overheads and improves the instruction cache hit ratio. It also saves code space and provides opportunities for other optimization measures.
Before:
int funA (int a) { return a * a; } int funB (int b) { return funA(b) + 1; }After:
int funB (int b) { return b * b + 1; } - Constant propagation and folding:
Compilation options -fdevirtualize, -fipa-cp, -fipa-cp-clone, -fipa-bit-cp, -fipa-vrp, -ftree-bit-ccp, -ftree-ccp, -ftree-dominator-opts, and -ftree-vrp are used to control the enabling of this function. By default, these options are enabled in -O2, -O3, and -Os levels. In the compilation phase, variables that can be directly calculated or the calculation results of multiple variables can be replaced with constants, which reduces calculation time during running.
Before:
int funA (int a) { return a * a; } int funB () { int a = funA(2); int b = a + 1; return b; }After:
int funB () { return 5; } - Eliminating public expressions:
This function is controlled by compilation options -fgcse, -fgcse-lm, -fgcse-sm, -fgcse-las, and -fgcse-after-reload, and is enabled by default in -O2, -O3, and -Os levels. The calculation process is optimized during compilation to reduce the calculation amount during running.
Before:
int a, b, c; b = (a + 1) * (a + 1); c = (a + 1) / 2;
After:
int a, b, c, tmp; tmp = a + 1; b = tmp * tmp; c = tmp / 2;
- Loop unrolling:
You can use -funroll-loops and -funroll-all-loops to enable this function. In scenarios where the number of loops is small, the compiler copies loop body code for multiple times to eliminate the loop control overheads, facilitate data expectation, and improve the cache hit ratio.
Before:
int a[4]; for (int i = 0; i < 4; ++i) { a[i] = i; }After:
int a[4]; a[0] = 0; a[1] = 1; a[2] = 2; a[3] = 3;
- Hoisting invariant loop code:
This function is controlled by the -fmove-loop-invariants compilation option. It is enabled by default when the optimization level is higher than -O1. After the optimization, the repeated calculation of the loop body is reduced.
Before:
int a[100]; void funA(int b) { for (int i = 0; i < 100; ++i) { a[i] = b * b + i; } }After:
int a[100]; void funA(int b) { int tmp = b * b; for (int i = 0; i < 100; ++i) { a[i] = tmp + i; } } - Variable induction:
Use the -fivopts compilation option to enable this function. It is enabled by default. The value of a variable in a loop is increased (or decreased) by a fixed value in each loop iteration. Therefore, multiplication can be replaced by addition, thereby accelerating the calculation.
Before:
int a[100]; for (int i = 0; i < 100; ++i) { a[i] = i * 9 + 3; }After:
int a[100]; int tmp = 3 for (int i = 0; i < 100; ++i) { a[i] = tmp; tmp += 9; } - After querying the virtual table, using function pointers in the virtual table to call virtual functions in C++:
If the compiler can determine which virtual function is to be called, it can directly call the function, which reduces function call overheads. Use the -fdevirtualize compilation option to enable this function. By default, this function is enabled in -O2, -O3, and -Os levels. The sample code is as follows:
class C0 { public: virtual void funA (); }; class C1 : public C0 { public: virtual void funA (); }; int main () { C1 c1; C0* c = &c1; c->funA(); return 0; }Before the optimization, the virtual table is used for function calling:

After the optimization, the function is directly called:

- Vectorization:
During compilation, code is vectorized and the NEON attribute is automatically used. NEON is a technology based on SIMD and can perform operations on multiple data at the same time based on a single instruction. When GCC uses -O3, the -ftree-vectorize option is automatically enabled. You need to add the -ftree-vectorize option under -O1 and -O2 to perform vectorization. In -O0 mode, vectorization cannot be performed even if -ftree-vectorize is added.
Reference code:
int a[64 * 4]; int b[64 * 4]; int c[64 * 4]; int main () { for (int i = 0; i < 64 * 4; i++) { c[i] = a[i] + b[i]; } return 0; }
PGO
Profile-guided optimization (PGO) is to make optimization decisions by collecting program runtime information (profile). PGO needs to be compiled and run twice. In the first compilation process, the compiler inserts some functions or instructions into the program for obtaining program runtime features. Then the compiler runs the program and saves feature information as a profile. In the second compilation process, the program feature profile saved during first program running is read. Then the compiler provides guidance for various optimization technologies to make optimization decisions according to these features, to generate a target program for a performance test.
PGO supports two feedback-based optimization technologies. For the first one, the compiler instrumentation, running, and feedback-directed compilation process is used. For the other, the compiler instrumentation is not required, and the perf tool is used to run programs, collect information, and conduct profile-guided compilation. The following describes the PGO methods of the compiler instrumentation. For details about the PGO methods of the perf tool, visit official websites:
- GCC: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#Optimize-Options
- Clang: https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
Using GCC
- Add the -fprofile-generate compilation option to enable the instrumentation application to generate profile information.
gcc -O2 -fprofile-generate vec.cpp
- Run the program to generate the profile information, that is, the gcda file.
./a.out
- Add the -fprofile-use compilation option and use the profile information to recompile the program.
gcc -O2 -fprofile-use vec.cpp
Using Clang
- Add the -fprofile-instr-generate compilation option to enable the instrumentation application to generate profile information.
clang -O2 -fprofile-instr-generate vec.cpp
- Run the program to generate the profile information. The default file name is default.profraw.
./a.out
- Use the llvm-profdata tool to convert the default.profraw file to a profile that can be identified by Clang.
llvm-profdata merge -output=code.profdata default.profraw
- Add the fprofile-instr-use compilation option, specify profile information, and recompile the program.
clang -O2 -fprofile-instr-use=code.profdata vec.cpp
PGO items:
- Register allocation: Non-profile-guided compilation generally uses a static heuristic register allocation algorithm to keep the variable value or calculation result in the register. PGO uses a priority-driven register allocation method. Priorities are determined based on the execution frequency of basic blocks to ensure that frequently used variables are preferentially allocated to registers.
- Cold and hot partitioning: When PGO is not used, the compiler statically performs cold and hot partitioning based on the program structure, which is not accurate enough. The profile information is used to accurately record the call frequency of basic blocks, so that hot and cold block partitioning can be more accurate. Then basic blocks are optimized, including loop unrolling and function inlining. The profile information is also used to rearrange basic blocks. Cold blocks are placed in a remote zone, and hot blocks are gathered, which helps improve utilization of the instruction cache.
- Function rearrangement: The function definition sequence in the source code determines the function sequence in code segments, and the function sequence in code segments determines the sequence of functions loaded to the memory. As a result, cold and hot functions are mixed. The compiler obtains the function call relationship based on the profile information and rearranges the function sequence in code segments based on the call stack sequence. It strips cold functions to the end of a code segment, and arranges the hot functions based on the function call stack, thereby reducing the jump instruction overheads and improving the cache hit ratio.
- Branch rearrangement: If conditional jump statements such as if/else and switch/case fail to be predicted, cache miss occurs. PGO uses instrumentation to collect the probability of each branch and adjust the branch calling sequence to reduce cache miss occurrences.
- Others: function inlining, constant propagation/folding, and loop unrolling mentioned in Optimization Levels.
LTO
Link time optimization (LTO) is program optimization during linking. Multiple intermediate files are combined to form a global calling diagram to optimize the entire program. Link time optimization is the analysis of the entire program and is cross-module.
Add the -flto compilation option to enable LTO. LTO is performed after compilation. Therefore, the problem that multiple .o files are unaware of each other can be solved, and the entire program is optimized globally (see Optimization Levels). For example, global function inlining optimization is more comprehensive than inlining optimization of a single .o file. Besides, whether useless code exists can be determined to reduce the code volume (whether code is called cannot be determined without enabling LTO due to multiple .o files exist).
It should be noted that LTO improves the program performance, but brings the problem that the compile time is long and the memory usage is high during compilation. To reduce the compile time caused by enabling LTO, LLVM proposes the ThinLTO technology. That is, add -flto=thin in the LLVM compiler.
AutoTuner
AutoTuner is an automated iteration process that optimizes a given program by manipulating compilation options for optimal performance. It works with the BiSheng Compiler and the AutoTuner CLI tool.
The AutoTuner optimization process consists of two phases: initial compilation and tuning.

In the initial compilation phase, the -fautotune-generate compilation option is added to the BiSheng Compiler. During the compilation, the BiSheng Compiler generates some YAML files that contain all adjustable structures, showing which structures in the target program can be tuned.
In the tuning phase, the AutoTuner first reads the generated YAML files of the adjustable structures to generate the corresponding search space. Then, the AutoTuner tries the values of a group of parameters based on the specified search algorithm to generate a compilation configuration file in YAML format, and then compiles the binary file of the target program. Finally, AutoTuner runs the compiled file in user-defined mode and returns performance information. After certain iterations, AutoTuner finds the final optimal configuration, generates the optimal compilation configuration file, and stores the file in YAML format.
Currently, AutoTuner can be used in two modes with two CLI tools respectively: llvm-autotune and auto-tuner. For details, see AutoTuner Feature Guide (BiSheng Compiler).
