SVE Enabling Methods and Optimization Methods

In BiSheng compiler, you can use either of the following methods to enable an SVE instruction set:

Run the -mcpu command to specify a CPU model and enable all default features of the CPU.
1
$ clang -mcpu=hip09
Run -march to specify a microarchitecture model and explicitly enable SVEs.
1 2
$ clang -march=armv8+sve $ clang -march=armv8+sve+sve2

After the SVE instruction set is enabled, SVE instructions can be generated in any of the following ways:

Compiler automatic vectorization
Programming through SVE intrinsics
Directly writing SVE assembly code
Calling SVE libraries

Compiler Automatic Vectorization

BiSheng compiler enables automatic vectorization at the -O2 level or higher. In addition, you can enable automatic vectorization by specifying the -fvectorize option. After an SVE instruction set is enabled, BiSheng compiler performs automatic vectorization based on related instructions.

Optimization diagnosis:

In actual user code, many loops may fail to be automatically vectorized due to various reasons (for example, too complex control flows and unsupported data types). BiSheng compiler can print diagnosis information to facilitate debugging. With this function, you can obtain information about whether all loops in the code are automatically vectorized and reasons why some loops are not automatically vectorized. In some cases, BiSheng compiler displays vectorization methods of the corresponding loops. This function includes three options:

-Rpass=loop-vectorize: displays all loops that are vectorized.
-Rpass-missed=loop-vectorize: displays all loops that are not vectorized.
-Rpass-analysis=loop-vectorize: displays all loops that are not vectorized and their causes and provides vectorization methods for some loops.

Example: test_Rpass.c

#include "math.h"

void test_rpass(float * a, int n) {
    for (int i = 0; i < n; i++) {
        a[i] = sqrt(a[i]);
    }
}

In the preceding test case, if only the -mcpu=hip09 -O3 option is added, the compiler cannot perform automatic vectorization.

If -Rpass-missed=loop-vectorize is added, the following information is displayed:

test_Rpass.c:4:3: remark: loop not vectorized [-Rpass-missed=loop-vectorize]
    4 |   for (int i = 0; i < n; i++) {
      |   ^

The preceding information indicates that the loop in line 4 is not automatically vectorized by the compiler. If -Rpass-analysis=loop-vectorize is added, the following information is displayed:

test_Rpass.c:5:12: remark: loop not vectorized: library call cannot be vectorized. Try compiling with -fno-math-errno, -ffast-math, or similar flags [-Rpass-analysis=loop-vectorize]
    5 |     a[i] = sqrt(a[i]);
      |            ^
test_Rpass.c:4:3: remark: loop not vectorized: instruction cannot be vectorized [-Rpass-analysis=loop-vectorize]
    4 |   for (int i = 0; i < n; i++) {
      |   ^

The preceding information indicates that the loop in line 4 is not automatically vectorized by the compiler because the library function is called in the loop and prompts the user to add options such as -fno-math-errno and -ffast-math. In this case, after the -ffast-math option is added, the compiler can automatically vector the loop. If the -Rpass=loop-vectorize option is added, the following information is displayed:

test_Rpass.c:4:3: remark: vectorized loop (vectorization width: vscale x 4, interleaved count: 2) [-Rpass=loop-vectorize]
    4 |   for (int i = 0; i < n; i++) {
      |   ^

The preceding information indicates that the loop in line 4 is successfully vectorized by the compiler, and specific information about vectorization is given.

Pragmas related to automatic vectorization:

You can add pragmas to code to assist a compiler in SVE-based automatic vectorization. Several pragmas and keywords related to automatic vectorization are introduced as follows:

Inform a compiler to ignore possible memory dependencies and vectorize a loop.
C/C++:
1
#pragma ivdep
Fortran:
1
!DIR$ IVDEP
Inform a compiler not to consider the internal cost model. Even if the compiler considers that vectorization brings negative performance benefits, vectorization is forcibly performed on a loop.
C/C++:
1
#pragma vector always
Fortran:
1
!DIR$ VECTOR ALWAYS
Inform a compiler not to perform automatic vectorization on a loop.
C/C++:
1
#pragma clang loop vectorize(disable)
Fortran:
1
!DIR$ NOVECTOR
Inform a compiler that there is no data dependency between iterations of a loop, and the corresponding restrictions do not need to be considered during automatic vectorization analysis.
C/C++:
1
#pragma clang loop vectorize(assume_safety)

Inform a compiler to perform fixed-length (fixed) or variable-length (scalable) automatic vectorization on a loop.

C/C++:

1	#pragma clang loop vectorize(enable) vectorize_width(fixed)

1	#pragma clang loop vectorize(enable) vectorize_width(scalable)

Inform a compiler to unroll an unvectorized loop. The number of unrolling times is specified by _value_.
C/C++:
1
#pragma clang loop unroll_count(_value_)
Fortran:
1
!DIR$ UNROLL(_value_)
Inform a compiler to interleave a vectorized loop. The number of interleaving times is specified by _value_.
C/C++:
1
#pragma clang loop interleave_count(_value_)
Inform a compiler to optimize the irregular memory access of a specified variable using the TBL instruction. _value_ corresponds to the target data. _num_ is an optional parameter that specifies the number of vector registers corresponding to the TBL instruction. Currently, the value can be 1 (default) or 2.

The correctness of the TBL instruction is guaranteed by the user, meaning the user must ensure that the data can be stored in the specified number of vector registers. For details, refer to the Arm documentation for an introduction to the TBL instruction.

C/C++:

1	#pragma clang loop lookup(_value_, _num_)

The following provides a specific example for #pragma clang loop vectorize(assume_safety):

//test_pragma.c
void update(int *restrict x, int *restrict idx, int count)
{
    #pragma clang loop vectorize(assume_safety)
    for (int i = 0; i < count; i++)
    {
        x[idx[i]]++;
    }
}

In the preceding loop, if no pragma is added, the same values may exist in idx[i]. If vectorization is performed, a memory conflict may occur. Therefore, the compiler does not perform automatic vectorization. If you confirm that no duplicate values exist in idx[i], you can add #pragma clang loop vectorize(assume_safety) before the loop to inform the compiler that there is no data dependency between iterations. Then, the compiler performs automatic vectorization. The generated assembly instructions include the following segments:

        …
        ld1w    { z0.s }, p0/z, [x1, x10, lsl #2]
        add     x10, x10, x11
        cmp     x9, x10
        ld1w    { z1.s }, p0/z, [x0, z0.s, sxtw #2]
        add     z1.s, z1.s, #1                  // =0x1
        st1w    { z1.s }, p0, [x0, z0.s, sxtw #2]
        b.ne    .LBB0_7
        …

Option tuning:

In addition, you can adjust the optimization details of a compiler by adjusting compilation options to further improve the performance of SVE automatic vectorization code. The following describes some methods that can be tried:

Vector length specific (VLS) programming mode

SVE allows two programming modes: vector length agnostic (VLA) and vector length specific (VLS), which are also called variable-length programming and fixed-length programming. The difference between the two modes lies in whether the vector register length is known during programming. The advantage of variable-length programming is that once compiled, code can be executed on hardware with different vector lengths without recompilation. Fixed-length programming provides important information about vector lengths for a compiler during compilation. Therefore, the compiler can perform more optimizations and the generated code tends to perform better. By default, BiSheng compiler generates VLA code. When the vector width information is passed through -msve-vector-bits=<length>, the compiler can generate VLS code.

$ clang -march=armv8+sve -msve-vector-bits=256 # The option indicates that 256-bit VLS code is generated.
$ clang -mcpu=hip09 # A hardware platform is specified to indirectly instruct the compiler to generate 256-bit VLS code.

-ffast-math option
If you enter fast-math mode, you allow a compiler to optimize floating-point operations more aggressively. In this mode, the compiler can enable some SVE instructions that damage precision, and something that hinders automatic vectorization may no longer block automatic vectorization due to the aggressive floating-point optimization.
1. This option affects the floating-point precision. You are advised to use this option when the precision is insensitive.

Enabling the vectorized math library that supports SVE

BiSheng compiler supports parallel mathematical function calculation and generates a vectorized interface call for the corresponding mathematical function. This function is controlled by -fveclib=<mathlib-name>. For the Kunpeng platform, BiSheng compiler integrates the libksvml vectorized math library. The libksvml vectorized math library integrated in BiSheng compiler 4.0.0 and later versions provides vectorized mathematical functions that support SVE. You can use a series of options -fveclib=KPL_SVML_SVE -fno-math-errno -lm -lksvml to enable the library. For details, see the following command:

$ clang -O3 -mcpu=hip09 -fveclib=KPL_SVML_SVE -fno-math-errno -lm -lksvml -S test_veclib.c # The libksvml-provided vectorized mathematical function call that supports SVE is generated and linked to the corresponding library.

The vectorized math library affects the floating-point precision. You are advised to use this option when the precision is insensitive.
Due to version mapping, BiSheng compiler 4.0.0 does not integrate the latest libksvml vectorized math library. It is known that correctness problems may occur on specific OSs. You can download the latest Kunpeng math library (KML) from the Kunpeng community and obtain the latest libksvml vectorized math library to replace the original one. Alternatively, you can add the -Wl,-z,relro,-z,now options to disable lazy-binding. (This has been resolved in BiSheng 4.1.0 and later versions.)

For details, see the following test_veclib.c test case:

#include "math.h"

void foo(double *f, int n) {
    for (int i = 0; i < n; ++i)
        f[i] = cos(f[i]);
}

By using the preceding compilation commands, the assembly instruction generated by the compiler includes the following segments (the -fno-unroll-loops option is added to disable loop unroll optimization when the following assembly is generated):

        ld1d    { z0.d }, p4/z, [x19, x23, lsl #3]
        bl      _ZGVsNxv_cos # The vectorized mathematical function that supports SVE is called.
        st1d    { z0.d }, p4, [x19, x23, lsl #3]
        add     x23, x23, x22
        cmp     x21, x23
        b.ne    .LBB0_8

Adjusting or disabling the gather or scatter operation

Gather and scatter are used in scenarios where data indexes are discontinuous when data is read or written. For details, see the following test_gather_scatter.c test case.

void foo1 (int * __restrict__ y, int * __restrict__ x, int * __restrict__ idx, int size) {
    for (int i = 0; i < size; i++) {
    y[i] = x[idx[i]]; // Indexes are discontinuous when data is read.
    }
}
void foo2 (int * __restrict__ y, int * __restrict__ x, int * __restrict__ idx, int size) {
    for (int i = 0; i < size; i++) {
    y[idx[i]] = x[i]; // Indexes are discontinuous when data is written.
    }
}

After the SVE instruction set is enabled, BiSheng compiler supports vectorization in the gather and scatter scenarios. The generated assembly contains the following segments:

        …
        ld1w    { z0.s }, p0/z, [x2, x10, lsl #2]
        ld1w    { z0.s }, p0/z, [x1, z0.s, sxtw #2] # Gather instructions
        st1w    { z0.s }, p0, [x0, x10, lsl #2]
        add     x10, x10, x11
        cmp     x9, x10
        b.ne    .LBB0_6
        …
        ld1w    { z0.s }, p0/z, [x1, x10, lsl #2]
        ld1w    { z1.s }, p0/z, [x2, x10, lsl #2]
        add     x10, x10, x11
        cmp     x9, x10
        st1w    { z0.s }, p0, [x0, z1.s, sxtw #2] # Scatter instructions
        b.ne    .LBB1_6
        …

Due to the high cost of gather and scatter instructions, the decision to execute the vectorization solution depends on the cost model in the automatic vectorization process. Considering that the cost model of the compiler performs static evaluation and analysis, it may not be reliable in actual scenarios. As a result, vectorization may fail in some scenarios that require gather and scatter vectorization, or vectorization may be performed in scenarios that do not require gather and scatter vectorization. A typical case is that when the gather step value is very large, cache misses may be very serious, and gather vectorization slows down the pipeline due to the long tail effect.

To solve the preceding problems, BiSheng compiler provides optimization options for users to manually adjust the cost of gather and scatter vectorization and completely disable gather and scatter instructions, as shown in the following commands:

$ clang -mcpu=hip09 -O3 -mllvm -sve-gather-overhead=[constant unsigned] # In the gather scenario, a larger value indicates that fewer gather instructions are generated. The default value is 5.
$ clang -mcpu=hip09 -O3 -mllvm -sve-scatter-overhead=[constant unsigned] # In the scatter scenario, a larger value indicates that fewer scatter instructions are generated. The default value is 5.
$ clang -mcpu=hip09 -O3 -mllvm -prefer-gather-scatter=[true|false] # You can enable or disable the gather and scatter vectorization. The default value is true.

In the preceding examples, when -mcpu=hip09 -O3 is set, BiSheng compiler performs vectorization in the gather and scatter scenarios, you can directly disable gather and scatter by specifying -mllvm -prefer-gather-scatter=false or adjust the cost by specifying -mllvm -sve-gather-overhead=10 -mllvm -sve-scatter-overhead=10. In this way, the compiler does not perform gather and scatter vectorization.

BOSCC optimization

This is for a loop with a control flow, that is, the loop body contains code such as "if-else" or "if-continue" (as shown in the following case). If BOSCC optimization is disabled, the compiler flattens the loop body during automatic vectorization to eliminate all branches. This vectorization solution has a disadvantage: All code in the if branch of the original loop is executed regardless of whether the condition is met. If most values of X[i] are 0 in an actual scenario, a large quantity of redundant memory access operations are introduced due to automatic vectorization, resulting in performance deterioration.

//test_boscc.c
void foo1 (int * __restrict__ A, int * __restrict__ B, int * __restrict__ C, int * __restrict__ X, int size) {
    for (unsigned i = 0; i < size; i++) {
        if (X[i]) {
            A[i] = B[i] + C[i];
        }
    }
}

For the preceding scenario, the -enable-boscc-vectorization=[true|false] option may be used to enable BOSCC optimization, and judgment conditions are added to the vectorization code. If all judgment conditions (corresponding to X[i]) in a vector operation are false, subsequent code of the vector operation is directly skipped. This optimization greatly improves the performance of a large number of data sets that do not meet the judgment conditions (compared with the scenario where the BOSCC feature is not enabled).

$ clang -mcpu=hip09 -O3 -mllvm -enable-boscc-vectorization=true -S test_boscc.c # The BOSCC optimization is enabled. The default value is false.

The BOSCC feature does not have performance gains for all scenarios. BiSheng compiler provides the preceding options to enable BOSCC optimization as an optimization method for users to try.

Tail block folding

The tail block folding optimization technology is an extension of SVE vectorization optimization. It uses a predicate register to control the effective iteration status of loop execution. In this way, the branch part of non-integer-multiple vectorization is folded to the core loop part, eliminating the loop branch part of a tail block and optimizing the code size and performance.

Example: tail-folding.c

void over_epilogue (double * a, int N){
    for (int i = 0; i < N; i++)
        a[i] = 2.0 * a[i];
}

When the Kunpeng architecture is specified using the -mcpu option and SVE vectorization is enabled, BiSheng compiler dynamically adjusts the current tail block folding policy based on code features. On this basis, you can use the -mllvm --prefer-predicate-over-epilogue= option to control whether to generate the tail block folding and adjust the structure. This option has the following three optional configurations:

scalar-epilogue: does not fold the tail block. A scalar loop body is created to process the tail block of the loop.

predicate-else-scalar-epilogue: attempts to fold the tail block for optimization. If the optimization fails, attempts are made to vectorize the tail block.

predicate-dont-vectorize: attempts to fold the tail block for optimization. If the optimization fails, the tail block loop body does not perform vectorization.

For the preceding test case, when the -mcpu=hip09 -O3 option is set, BiSheng compiler does not fold the tail block. You can set the -mllvm --prefer-predicate-over-epilogue=predicate-else-scalar-epilogue option to control the compiler to fold the tail block.

$ clang -mcpu=hip09 -O3 -mllvm --prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S tail-folding.c # The tail block folding is enabled for optimization.

If the preceding options are not added, the assembly code generated through compilation includes the following assembly code blocks, which is used to process the remaining tail blocks after vectorization. If the corresponding options are added, nothing is generated (the multiply-add operation is performed only for the z register).

...
.LBB0_4:                                // %for.body
                                        // =>This Inner Loop Header: Depth=1
        ldr     d0, [x10]
        fadd    d0, d0, d0
        str     d0, [x10], #8
        subs    x8, x8, #1
        b.ne    .LBB0_4
...

SVE Intrinsics Programming and Directly Writing SVE Assembly Code

BiSheng compiler supports all SVE intrinsic interfaces defined in Arm C Language Extension (ACLE) for SVE. You can directly call the corresponding intrinsic interface in advanced languages such as C and C++ to generate the corresponding instruction. For details about the interface list and behaviors, see ACLE for SVE.

When the SVE intrinsics programming is performed, the header file arm_sve.h needs to be referenced. This header file provides SVE intrinsic interfaces supported by BiSheng compiler and definitions of SVE vector types (corresponding to the z register) and SVE predicate types (corresponding to the p register). For details, see ACLE for SVE. The following test_intrinsic.c test case is an example of SVE intrinsics programming:

#include <arm_sve.h>
double test_sve_intrinsic(svbool_t pg, svfloat64_t op) {
    double result = svaddv(pg, op);
    return result;
}

In the preceding example, svfloat64_t is a vector type, indicating a vector of the 64-bit floating-point type (corresponding to zn.d), svbool_t is a predicate type (corresponding to pn), and svaddv is an SVE intrinsic for performing accumulative addition operations. You can obtain the corresponding assembly as follows:

$ clang -mcpu=hip09 -O3 test_intrinsic.c -S
…
faddv   d0, p0, z0.d
...

The faddv instruction adds the elements whose corresponding bits of the p0 register are valid in the z0 register and stores the result in the d0 register. For more cases of SVE intrinsics programming, see the official Arm document SVE-SVE2-programming-examples.

If you have a deep understanding of the SVE instruction set, you can directly use the SVE assembly code to compile functions. Note that the compiled functions must meet the ABI requirements of Procedure Call Standard for Arm Architecture (AAPCS) for function calls. If the SVE assembly code is directly compiled and the instruction set is specified, BiSheng compiler can be used to generate the corresponding object file and executable file. The following is an example of test_assembly_code.s:

        .globl  test_sve_intrinsic              // -- Begin function test_sve_intrinsic
        .p2align        4
        .type   test_sve_intrinsic,@function
        .variant_pcs    test_sve_intrinsic
test_sve_intrinsic:                     // @test_sve_intrinsic
        .cfi_startproc
// %bb.0:
        faddv   d0, p0, z0.d
                                        // kill: def $d0 killed $d0 killed $z0
        ret
.Lfunc_end0:
        .size   test_sve_intrinsic, .Lfunc_end0-test_sve_intrinsic
        .cfi_endproc

You can compile the target file as follows:

$ clang -mcpu=hip09 -O3 -c test_assembly_code.s

Parent topic: SVE Optimization Guide