Customized Optimization Options

BiSheng compiler supports the user-defined optimization option driven by -mllvm. Because the optimization is based on the Kunpeng architecture, the Kunpeng architecture needs to be specified to enable the user-defined optimization option, for example, set -mcpu to tsv110.

-mllvm -force-customized-pipeline=<true|false>

This option forcibly uses the customized pass pipeline. The value true indicates that the optimization is enabled. By default, the optimization is disabled.

-mllvm -sad-pattern-recognition=<true|false>

This option optimizes the absolute value summation operation for differences (sum += abs(a[i] – b[i])) to generate a more simplified and efficient operation sequence. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -instcombine-ctz-array=<true|false>

This option optimizes the calculation for De Bruijn sequence table lookup. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -aarch64-loopcond-opt=<true|false>

This option reduces unnecessary instructions for loop condition judgment under some conditions to optimize the code. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -aarch64-hadd-generation=<true|false>

This option uses only one Arm NEON instruction URHADD to complete the vectorized operation (x[i] + y[i] + 1) >> 1 and optimize the code. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -enable-mem-chk-simplification=<true|false>

This option simplifies the logic of runtime checks generated for LLVM loop vectorization and improves loop vectorization code. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -aarch64-ldp-stp-noq=<true|false>

This option prohibits the generation of stp/ldp q1, q2, or addr instructions. The performance of these instructions is not ideal. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -enable-func-arg-analysis=<true|false>

This option enhances LLVM range analysis to adapt LLVM function specialization optimization to more functions. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -enable-modest-vectorization-unrolling-factors=<true|false>

This option simplifies vectorization for loops with a smaller step. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -instcombine-shrink-vector-element=<true|false>

This option improves the degree of parallelism (DOP) of vectorized instructions and eliminates the scalar median value generated during vectorization, improving the effect of loop vectorization. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -instcombine-reorder-sum-of-reduce-add=<true|false>

This option changes the sequence of reduction operations to improve the reduction code. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -replace-fortran-mem-alloc=<true|false>

This option allocates stack memory, instead of heap memory, to improve performance when a memory allocation operation of known size (such as arrays) is required in Fortran code. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

The memory size for the optimization is specified by -mllvm -max-fortran-heap-to-stack-size=<Int number>. The default value is 4096.

-mllvm -enable-pg-math-call-simplification=<true|false>

This option simplifies the calling of multiple Fortran math library functions to advance the calling performance. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -instcombine-gep-common=<true|false>

This option optimizes the element address calculation for multi-dimensional arrays in complex scenarios (such as nested loops) to reduce the register pressure and improve program performance. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -enable-sroa-after-unroll=<true|false>

This option enables the function of adding SROA after loop unrolling to reduce memory access operations and store variables in the register. The value true indicates that the optimization is enabled. By default, the optimization is enabled.

-mllvm -disable-recursive-bonus=<true|false>

This option makes function calling in a recursive function easier to be inlined, improving the performance of frequently called recursive functions. The value true indicates that the inline operation is disabled. The default value is false, indicating that the inline operation is enabled.

-mllvm -disable-recip-sqrt-opt=<true|false>

This option optimizes the formats of A = (C / sqrt(Y)) and B = A * A in FastMath scenarios to reduce the number of instructions. The value true indicates that the optimization is disabled. The default value is false, indicating that the optimization is enabled.

-mllvm -disable-loop-aware-reassociation=<true|false>

This option adds loop awareness to Reassociate Pass to limit some operations within the loop, preventing performance deterioration caused by the increase of instructions in the loop. The value true indicates that the optimization is disabled. The default value is false, indicating that the optimization is enabled.

-mllvm -enable-gzipcrc32=<true|false>

This option identifies the CRC32 calculation logic in the code and uses the built-in instructions of the processor to accelerate the calculation. If this option is set to true, the optimization is enabled. If this option is set to false, the optimization is disabled. By default, the optimization is disabled.

If this option is enabled:

Ensure that the AArch64 is used.
Ensure that the optimization level is O3.
The link option needs to be added: -lclang_rt.irlib -L <install_path>/lib/clang/15.0.4/lib/aarch64-unknown-linux-gnu.

If the <install_path>/lib/clang/15.0.4/lib/aarch64-unknown-linux-gnu/clang_rt.irlib file does not exist, do not enable this option.

-mllvm -enable-aes=<true|false>

This option identifies the AES calculation logic in the code and uses the built-in instructions of the processor to accelerate the calculation. If this option is set to true, the optimization is enabled. If this option is set to false, the optimization is disabled. By default, the optimization is disabled.

If this option is enabled:

Ensure that the AArch64 is used.
Ensure that the optimization level is O3.
The link option needs to be added: -lclang_rt.irlib -L <install_path>/lib/clang/15.0.4/lib/aarch64-unknown-linux-gnu.

If the <install_path>/lib/clang/15.0.4/lib/aarch64-unknown-linux-gnu/clang_rt.irlib file does not exist, do not enable this option.

-mllvm -enable-mayalias-loadpromotion=<true|false>

At LTO and O3, unnecessary load instructions in the loop branch are identified and deleted. If this option is set to true, the optimization is enabled. If this option is set to false, the optimization is disabled. By default, the optimization is disabled.

-mllvm -enable-merge-reversed-icmps=<true|false>

At O3, reversed consecutive integer comparisons are merged (the target must be in little-endian order). If this option is set to true, the optimization is enabled. If this option is set to false, the optimization is disabled. By default, the optimization is disabled.

-fno-inline-reduction-intrinsic

At O1, O2, and O3, BiSheng compiler enables minloc and maxloc inlining in the flang1 phase. After inlining, the functions can be called simply using for loops, which facilitates further optimization in LLVM. This option can disable inlining, which is the same as O0.

-mllvm -update-iv-scev

This option updates the SCEV analysis result in induction variable users pass to display more tuning opportunities. This option is enabled by default, which may increase the compilation duration. If you have high requirements on the compilation duration, you can set -mllvm -update-iv-scev to false.

gep-common

This option generates a common parent for GEP clusters that originate from the same instruction by removing add instructions (that are used as indexes).

-mllvm -gep-common=<true|false> indicates whether to enable the optimization. If the value is set to true, the optimization is enabled. By default, the optimization is enabled.
-mllvm -gep-cluster-min=<Int number> indicates the GEP cluster threshold. The default value is 3.
-mllvm -gep-loop-mindepth=<Int number> indicates the loop threshold. The default value is 3.

array-restructuring

This option optimizes the memory access mode of one or more arrays in a program and rearranges arrays to reduce the running time.

-mllvm -enable-array-restructuring=<true|false> indicates whether to enable the optimization. If the value is set to true, the optimization is enabled. By default, the optimization is enabled.
-mllvm -skip-array-restructuring-codegen=<true|false> indicates whether to disable the code generation part of the optimization pass. If the value is set to true, the code generation part of the optimization pass is disabled. The default value is false.

struct-peel

This option optimizes structure peeling and increases the local cache when the structure fields in a structure array are accessed, reducing the running time.

-mllvm -enable-struct-peel=<true|false> indicates whether to enable the optimization. If the value is set to true, the optimization is enabled. By default, the optimization is enabled.
-mllvm -struct-peel-skip-transform=<true|false> indicates whether to disable the code generation part of the optimization pass. If the value is set to true, the code generation part of the optimization pass is disabled. The default value is false.
-mllvm -struct-peel-this=... indicates forcibly peeling a structure defined by the user (subject to legality).

-fopenmp-reduction-duplicate

This option enhances the vectorization capability in the OpenMP reduction scenario. This takes effect at the AArch64 backend and C and C++ frontends only when -fopenmp is enabled. This option is enabled by default. You can run -fno-openmp-reduction-duplicate to disable it.

-fopenmp-firstprivatize-locals

This option enhances the vectorization capability in the OpenMP firstprivatize scenario. This takes effect at the AArch64 backend and C and C++ frontends only when -fopenmp is enabled. This option is enabled by default. You can run -fno-openmp-firstprivatize-locals to disable it.

-mllvm -sort-ivusers-before-lsr

Loop strength reduction (LSR) optimization is performed only after induction variable users are sorted. This prevents binary assembly inconsistency during multiple compilations.

To save compilation time, this option is disabled by default.

-finline-fortran-runtime-calls

BiSheng compiler enables inlining string comparison in the flang2 phase. After inlining, a function call becomes a simple for-loop character comparison, which can be further optimized in the LLVM. The inlining function is disabled by default. You can use this option to enable the inlining function.

-foverflow-shift-alt-behavior

For undefined shift behavior that exceeds the bit width of the integer data type, for example, (int) a << 40, BiSheng compiler optimizes the expression to an integer constant in advance to prevent the expression from being identified and optimized as different values in different optimizations. This option is disabled by default.

-mllvm -warn-large-symbols=<num>

In C, C++, and Fortran programs, if ultra-large symbols exist, the relocation type of symbol addressing cannot support such a large range offset. This option is used to identify large symbols that may exist in the source code (the threshold of large symbols is controlled by the option parameter num, in MB). In addition, the corresponding alarm is provided.

-mllvm -aggressive-instcombine-simplify-mul64=<true|false>

This option simplifies the algorithm of multiplying two 64-bit operands to output a 128-bit data into a more efficient instruction. The value true indicates that the optimization is enabled. By default, the optimization is enabled. Currently, the C and C++ languages and the AArch64 backend are supported.

-mllvm -replace-sqrt-compare-by-square=<true|false>

This option converts the comparison condition by replacing the square root calculation in the comparison condition with the square calculation of its result. This optimization is enabled only when -ffast-math is enabled. The value true indicates that the optimization is enabled. By default, the optimization is disabled. Currently, the C and C++ languages and the AArch64 backend are supported.

-mllvm -enable-combine-sqrt-exp=<true|false>

This option optimizes sqrt(exp(x)) to exp(x*0.5) to eliminate sqrt calculations with high execution costs. This optimization is enabled only when -ffast-math is enabled. The value true indicates that the optimization is enabled. By default, the optimization is enabled. Currently, the C and C++ languages and the AArch64 backend are supported.

-mllvm -loop-load-widen-patterns=<id1>,<id2>,...

This option optimizes scenarios where some data can be accessed using a wider data type. Currently, three scenarios are supported, and the IDs are 0, 1, and 2, respectively. If the default value is used, all of them are enabled. If an ID less than 0 (such as –1) is specified, the optimization is disabled. Currently, the C language and the AArch64 backend are supported.

-mllvm -enable-aggressive-inline=<true|false>

This option does not consider the __attribute__((noinline)) restriction in the source code and forcibly regards the function as a common function to determine whether to perform inline optimization. The value true indicates that the optimization is enabled. By default, the optimization is disabled. Currently, the C and C++ languages and the AArch64 backend are supported.

shift rounding

This option matches a rounding scenario and selects an appropriate instruction to reduce the running time.

-mllvm -aarch64-optimize-rounding=<true|false> controls the optimization. The value true indicates that the optimization is enabled. By default, the optimization is enabled.
-mllvm -aarch64-optimize-rounding-saturation=<true|false> determines whether to optimize the SQRSHRUN/UQRSHRN scenario. The value true indicates that the optimization is enabled. By default, the optimization is enabled.
The -mllvm -aarch64-extract-vector-element-trunc-combine=<true|false> provides better instruction selection in some scenarios. The value true indicates that the optimization is enabled. By default, the optimization is enabled.
-mllvm -aarch64-rounding-search-max-depth=<integer> sets the search depth of a rounding scenario. The default value is 4.

-mllvm -aggressive-instcombine-simplify-sqr64=<true|false>

This option optimizes the 64-bit SQR operation to use simplified instructions to improve performance. The value true indicates that the optimization is enabled. By default, the optimization is enabled. Currently, the C and C++ languages and the AArch64 backend are supported.

-mllvm -enable-value-spec=<true|false>

For the res = x ? y : z ternary operator expression, when y or z contains zero and is selected as the result, calculations such as multiplication that references the res variable can be optimized to improve performance. This optimization identifies a select instruction formed by the ternary operator in an IR and eliminates some redundant instructions that use the select instruction result. Note that this optimization changes the select instruction only when the ternary operator forms the select instruction in the middle-end of the compiler.

-mllvm -vspec-min-select-users=<integer> controls the number of redundant instructions that use the select instruction result. When the actual number is greater than the specified value, the optimization is enabled. The default value is 5.
-mllvm -vspec-search-users-depth=<integer> controls the depth of a basic block that contains redundant instructions and is searched downwards from the basic block where the select instruction result is located. The default value is 5.

-mllvm -relaxed-ordering-level=<RO_DISABLE|RO_1|RO_2|RO_3>

RO_DISABLE indicates that weak memory ordering protection is disabled, and RO_1, RO_2, or RO_3 indicates that weak memory ordering protection is enabled. RO_1 uses the basic repair policy to ensure normal application execution and has the greatest performance loss. RO_2 uses the most secure repair policy and has great performance loss. RO_3 applies component optimization rules to reduce performance loss.

-mllvm -enable-highlevel-branch-prediction=<true|false>

It predicts the execution direction of a branch based on the information in the code. If this option is set to true, the optimization is enabled. If this option is set to false, the optimization is disabled. By default, the optimization is disabled.

-mllvm -enable-boscc-vectorization

On the basis of code vectorization, the control flow in the scalar loop is introduced again. If this option is set to true, the optimization is enabled. If this option is set to false, the optimization is disabled. By default, the optimization is disabled.

-mllvm -ahead-prefetch=<true|false>

It prefetches data on the entire calculation array in advance for a simple loop calculation with OpenMP enabled. This takes effect only at the AArch64 backend. If this option is set to true, the optimization is enabled. If this option is set to false, the optimization is disabled. By default, the optimization is enabled.

-mllvm -antisca-spec-mitigations=<true|false>

It adds protection instructions to all instructions that may access the memory to prevent Spectre variant 1 attacks.

-mllvm -enable-spectre-detect=<true|false> -mllvm -aarch64-slh-vuln-loads-only=<true|false>

The first option indicates that 15 Spectre variant 1 attack scenarios mentioned in papers are detected and the specific file names and line numbers are provided through warning messages. The second option protection instructions are added to these detected scenarios.

-mllvm -enable-new-loop-distribute=<true|false>

It enables more aggressive loop splitting optimization to facilitate loop vectorization.

-mllvm -aarch64-enable-early-sve-libcall-opts=<true|false>

It replaces std::find() with a better performance version implemented based on SVE instructions inside BiSheng compiler.

-mllvm -enable-sm4 -march=armv8-a+sm4

It enables SM4 block cipher algorithm instructions at the backend.

-mllvm -enable-malloc-distribute=<true|false>

The loop for performing the malloc operation on multiple arrays that may become performance bottlenecks is split to improve the memory access performance.

-mllvm -loop-iteration-prefetch-funcs=<func>:<iter>

When this option is used, a centralized prefetch is performed on data to be accessed in the next <iter> iterations of all loops in the <func> function, once every <iter> iterations. This is useful for optimizing scenarios where, within a single iteration of the loop, there are multiple memory access operations, and the addresses being accessed are very scattered.

-mllvm -licm-enable-sink-store-cfg=true

It enhances certain loops with control flow and a single exit block, supporting the movement of store operations outside the loop for one-dimensional arrays with the __restrict__ keyword and ordinary pointer variables. This optimization reduces memory read and write operations.

-mllvm -simplifycfg-hoist-abs

In scenarios where control flow blocks vectorization, it uses the select instruction to simplify branch jumps, thereby assisting in vectorization.

-mllvm -combine-complex-gather-load

In scenarios where gather load accesses complex types, it improves memory access efficiency by combining load instructions.

-mllvm -ascend-load-address=[true|false]

When it is enabled, during instruction scheduling, load-type instructions that can be scheduled are arranged in ascending order based on the size of their offset immediate values.

-m[no-]fcmla

More efficient instruction FCMLA is generated for complex-number-related computations (complex multiplication, subtraction, and conjugate multiplication) to improve performance. -mfcmla is used to enable the function, and -mno-fcmla is used to disable the function.

-fopenmp-htl

BiSheng compiler supports the use of lightweight threads in the Hyper Thread Library (HTL) to replace system threads in OpenMP scenarios. You do not need to modify the code. The compiler runtime supports this function. The HTL is not directly integrated in BiSheng. For details about how to download the HTL and more information, visit the Kunpeng community.

To enable the HTL, perform the following steps:

Compilation: Use -L to specify the path of the HTL libhtl.so.
Running: Use LD_LIBRARY_PATH to specify the path of libhtl.so.
Control option: -fopenmp-htl (If both -fopenmp-htl and -fopenmp are used, -fopenmp-htl has a higher priority.)
Environment variable: KMP_HTL_NUM_ESS sets the number of execution flows running on OS-level threads (similar to specifying the number of Pthreads).

Parent topic: BiSheng Compiler Options