Rate This Document
Findability
Accuracy
Completeness
Readability

Tuning Strategy

The logic for optimizing the compiler precision is as follows:

  • Set the compilation option of the Kunpeng and x86 platforms to -O1 and use the same open source SLEEF math library. In this circumstance, check whether the precision of Kunpeng is the same as that of x86, or whether the precision is improved compared with that of -O3. Because ICC has the loop combining bug, you may need to use the -O1 option for specific files.
    Figure 1 ICC loop combining bug

    According to the code semantics, in this loop, the calculation process of SUMDP is to add several numbers in the sequence of the array index. According to the test result, the SUMDP calculation results are different when ICC uses the -O3 and -O0/-O1 compilation configurations.

    This example uses the binary differences in the sum of the sequential addition of six numbers (the correct result by manual calculation is 0b01000110010110001100000000110100):

    • When ICC uses -fp-model precise -O3 compilation configuration, the result is 0b01000110010110001100000000110011.
    • When ICC uses -fp-model precise -O0 or -fp-model precise -O1 compilation configuration, the result is 0b01000110010110001100000000110100.
    • When BiSheng uses -O3 or -O0 compilation configuration, the result is 0b01000110010110001100000000110100.

    Therefore, it can be inferred that there is precision loss when ICC uses the -fp-model precise -O3 compilation configuration to optimize the loop. This tuning conflicts with -fp-model precise, which is a bug of ICC.

    In the WRF source code, the code that has the bug are as follows:

    • physics/module_cu_bmj.F (corresponding physical parameter: cu_physics=2)
    • physics/module_cu_gf_deep.F (corresponding physical parameter: cu_physics=3)
    • physics/module_ra_goddard.F (corresponding physical parameter: ra_lw_physics=5/ra_sw_physics=5)
  • After the difference of the option O1 is excluded, select compilation options from Table 1 for tuning based on the pre-set precision benchmark. In the following policies, the optimal precision mode is most commonly used.
    Table 1 Precision tuning policy

    Precision Formula

    x86 Option

    BiSheng Option

    Best Result

    Optimal performance mode

    O3

    O3 ffp-model=fast

    Approximate

    Optimal precision mode

    O3 -fp-model=precise

    -no-ftz -init=zero -init=arrays

    O3 -faarch64-pow-alt-precision=21 -enable--alt-precision-math-functions km_l9 -Hx,124,0xc00000 -ffp-contract=off -finit-zero -mllvm -disable-sincos-opt -MflushZ

    or

    -O3 -ffp-compatibility=18/21 -finit-zero -lkm_l9

    Consistent

    Precision and performance compromise mode

    Remove some of the precision options: -no-ftz, -init=zero, and -init=arrays.

    Remove some of the precision options: -ffp-contract=off, -finit-zero, and -disable-sincos-opt.

    Consistent

    Table 2 Main compilation options affecting precision

    Category

    Precision Compilation Option

    x86 Option

    BiSheng Option

    BiSheng Tuning Option

    Impact on Precision

    Compiler

    Compiler version

    2018/2021

    2.1-3.1

    2.1-3.1

    Different compiler versions may lead to inconsistent results. In fast mode, different Intel compiler versions lead to different floating-point processing precisions, causing key indicators to fluctuate. Kunpeng cannot be fitted as well. Therefore, use a compiler of a version specified by the customer. For BiSheng, a later version is preferred.

    Optimization option

    Optimization option

    O0-O3-Ofast

    O0-O2

    O3-Ofast

    The O0 option disables all optimizations, which has a great impact on performance. Therefore, you are advised not to enable this option. Optimize the precision based on the O3 option. When the loop-combing bug of ICC occurs, the -O1 option can be used only for specific files.

    Floating-point model

    Calculation reordering

    None

    Disabled by default.

    funsafe-math-optimizations

    The following reordering compilation options can be enabled:

    -fno-signed-zeros

    -fno-trapping-math

    -fassociative-math

    -freciprocal-math

    Fast calculation

    ffast-math

    Disabled by default.

    ffast-math

    The following options are enabled by default:

    -fno-honor-infinities

    -fno-honor-nans

    -fno-math-errno

    -ffinite-math

    -fassociative-math

    -freciprocal-math

    -fno-signed-zeros

    -fno-trapping-math

    -ffp-contract=fast

    Unified precision option

    None

    ffp-compatibility

    Disabled by default.

    Set -ffp-compatibility=18/21 to enable this option. If the related precision options are not set, the following options are enabled:

    Setting -ffp-compatibility=21 -fp-model=precise is equivalent to setting -ffp-contract=off -faarch64-pow-alt-precision=21 -Hx,124,0xc00000 -mllvm -enable-alt-precision-math-functions.

    The following lists the precision options of BiSheng compared with Intel 2018:

    Setting -ffp-compatibility=18 -fp-model=precise is equivalent to setting -ffp-contract=off -faarch64-pow-alt-precision=21 -Hx,124,0x800000 -mllvm -enable-alt-precision-math-functions -enable-18-math-compatibility.

    The -ffp-compatibility option is not added to the following precision options due to specific reasons:

    • -finit-zero: Initializes the uninitialized variables in Fortran to zero. The Intel compiler uses an independent option for controlling. In addition, some applications cannot run properly after the option is added to the Intel compiler. In this case, BiSheng fails to enable the corresponding options during precision tuning. Therefore, the -ffp-compatibility option is not added.
    • -fp-model=: -ffp-compatibility may support -fp-model=fast in the future, the behavior of -ffp-compatibility can be adjusted based on -fp-model to fit precise and fast of ICC, respectively. Therefore, the -ffp-compatibility option is not added.
    • -mllvm -disable-sincos-opt: It is enabled by default if -fp-model=precise is set.
    • -enable-alt-precision-math-functions -enable-18-math-compatibility: This option generates KML precision interfaces for calling, which need to be bound to the KML.

    Floating-point precision option

    fp-model=precise

    ffp-model=precise by default

    ffp-model=fast

    ffp-model has the following three modes:

    • The precise mode (by default) is equivalent to setting

    -ffp-contract=fast -fno-rounding-math.

    • The strict mode is equivalent to setting

    -ftrapping-math -frounding-math -ffp-exception-behavior=strict.

    • The fast mode is equivalent to setting

    -menable-no-infs -menable-no-nans -menable-unsafe-fp-math -fno-signed-zeros -mreassociate

    -freciprocal-math -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only.

    Math function

    The sincos function optimization

    None

    disable-sincos-opt

    Default

    BiSheng processes consecutive sin and cos calculations collectively, which causes precision differences.

    The min/max function optimization

    None

    faarch64-minmax-alt-precision

    Default

    Modify the tuning policy of the min/max function so that the calculation result of the min/max function is consistent with that of a non-Arm platform.

    recip instruction optimization

    None

    mllvm -aarch64-recip-alt-precision

    Default

    Use soft floating-point compensation so that the calculation result of the recip instruction is consistent with that of a non-Arm platform.

    rsqrt instruction optimization

    None

    mllvm -aarch64-rsqrt-alt-precision

    Default

    Use soft floating-point compensation so that the calculation result of the rsqrt instruction is consistent with that of a non-Arm platform.

    Reciprocal of square root optimization

    None

    mllvm -disable-recip-sqrt-opt=<true|false>

    Default

    In the fastmath scenario, optimize A = (C / sqrt(Y)); B = A * A to use fewer instructions in the calculation. The value true indicates that the optimization is disabled. The default value false indicates that the optimization is enabled.

    Fused multiply-add

    None

    ffma-combine-fdiv

    Default

    A common option used to optimize the expression a/b+c to fma(a, 1/b, c), which makes calculation results consistent with those on a non-Arm platform. This option takes effect only when -ffp-contract=fast is set.

    Fused multiply-add

    None

    ffma-reverse-associative

    Default

    A common option used to optimize the expression ab+cd to fma(a, b, c*d), which makes calculation results consistent with those on a non-Arm platform. This option takes effect only when -ffp-contract=fast is set.

    Law of commutativity and law of associativity permission

    None

    fassociative-math

    Default

    Allow the law of commutativity and law of associativity.

    Division function optimization

    None

    freciprocal-math

    Default

    Allow conversion from division to multiplication by reciprocal.

    Optimization of loose inline math functions

    None

    frelaxed-math

    Default

    Use inline math functions.

    Rounding error

    fp-port

    fno-rounding-math

    frounding-math by default

    Specify whether to use rounding of the IEEE 754 standard. By default, rounding to the nearest value is enabled.

    CRC32 calculation optimization

    None

    mllvm -enable-gzipcrc32=false

    mllvm -enable-gzipcrc32=true by default

    Identify the CRC32 calculation logic in the code and use the built-in instructions of the processor to accelerate the calculation. The value true (by default) indicates that the optimization is enabled. The value false indicates that the optimization is disabled.

    INF optimization

    None

    ffinite-math-only

    Default

    Assume that there is no infinity or NaN. In the Clang version, the option is equivalent to -fno-honor-nans -fno-honor-inifinities.

    Math function optimization

    None

    Default

    menable-unsafe-fp-math

    Allow unsafe floating-point math function optimization which may reduce precision.

    Math library

    The csqrtf and zsqrtf functions fitting

    IMF by default

    faarch64-pow-alt-precision=18/21

    libm/pgmath by default

    The square root results of complex numbers are inconsistent on the two platforms.

    The pow function fitting

    Default IMF 80-bit precision version powr8i4

    faarch64-pow-alt-precision=18/21

    libm by default

    The tuning of common expressions for calculating exponents, such as a**4 and 2**a, is determined by the compiler. As a result, there is significant difference between ICC and BiSheng compiler in such expressions. To solve this problem, the -faarch64-pow-alt-precision=18/21 option is developed on the compiler side to fit ICC's tuning. This option enables various pow optimization expressions to be consistent with ICC at optimization levels such as O0-Ofast. If the parameter is set to 18, the expression optimization is consistent with ICC 2018.1. If this parameter is set to 21, the expression optimization is consistent with ICC 2021.

    The log function enablement

    IMF by default

    Disabled by default.

    mllvm -enable-18-math-compatibility

    Replace math functions such as tgammaf, cbrt, log, and log10 with functions suffixed with _18 to control the precision of math functions (used together with the KML). This optimization option takes effect only on O1 or upper levels, when -mllvm -enable-alt-precision-math-functions is enabled.

    The sin/cos function precision enablement

    IMF by default

    enable-alt-precision-math-functions km_l9

    libm by default

    Different math libraries have slightly different implementations of some math functions, such as asindf, cosdf, cbrt, powr8i4 and exp2. Although the number of differences and the bits of differences (for example, only the last bit is different) are small, the errors can be amplified in the iterative calculation of the application.

    WRF testing shows: There is error in the last bit of the output data of atan2f, which is amplified in the subsequent WRF calculations. As a result, the accumulated precipitation in the 24-hour weather forecast result exceeds 50 mm (50 mm precipitation in actual weather is the rainstorm level).

    Implement functions based on the ICC math library algorithm and perform verification by overwriting sufficient data. This optimization option takes effect only when the optimization level is O1 or higher. The logic is to change the function names __mth_i_cosd, of __mth_i_asind and __pd_powi_1 to cosdf, asindf, and powr8i4 respectively.

    Floating-point control

    Immediate data precision bug fixing

    None

    124,0xc00000

    Not fixed by default

    There is some immediate data in the code. When a small number of the immediate are accessed by different compilers, the data that is read to the memory is inconsistent. Take 0.002362E12_8 as an example:

    When read by ICC to the memory, the value (hexadecimal) is 41E1992860000000.

    When read by BiSheng to the memory, the value (hexadecimal) is 41E1992840000000.

    When read by GCC to the memory, the value is (hexadecimal) 41E1992840000000.

    Add this compilation option to solve this problem.

    FMA

    [no-]fma

    ffp-contract=off

    ffp-contract=on by default

    Specify whether to generate the fused multiply-add calculation.

    ftz fitting

    [no-]ftz

    clang: fdenormal-fp-math=[ieee|preserve-signs|positive-zero]

    flang-SSE-x86: Mflushz

    flang: -Mdaz

    Not required for processing by default

    The x86 and Kunpeng processors are different in processing FTZ instruction:

    • They are different in the order of performing rounding and FTZ judgment (which one first). x86 performs rounding before judgment, while Kunpeng performs judgment before rounding.
    • They have different definitions of FTZ boundary values. x86 complies with the IEEE standard, that is, subnormal numbers are not processed by default. However, Kunpeng processes boundary values based on the following rules:
      • preserve-signs retains subnormal number symbols.
      • positive-zero/Mdaz updates subnormal number to positive 0.

    Uninitialized array

    init=zero -init=arrays

    finit-zero

    Random initialization by default

    Uninitialized scenarios are initialized by the compiler, for which the Fortran language standard lacks specifications. Therefore, there is a lack of consistency in the initialization scope and mode. The BiSheng compiler is well developed for these scenarios. Currently, the -finit-zero option can be used to implement secure initialization.

    Sum of absolute values

    None

    mllvm -sad-pattern-recognition=false

    mllvm -sad-pattern-recognition=true

    Optimize the absolute sum calculation of the difference (sum += abs(a[i] - b[i])) to generate a simplified and efficient calculation sequence. The value true (by default) indicates that the option is enabled.

    Vectorization calculation

    None

    mllvm -aarch64-hadd-generation=false

    mllvm -aarch64-hadd-generation=true

    Use an Arm NEON instruction URHADD to complete the vectorization calculation ((x[i] + y[i] + 1) >> 1), to generate better code. The value true (by default) indicates that the optimization is enabled.

    Reduction sequence

    None

    mllvm -instcombine-reorder-sum-of-reduce-add=false

    mllvm -instcombine-reorder-sum-of-reduce-add=true

    Change the sequence of reduction operations to generate better reduction code. The value true (by default) indicates that the optimization is enabled.

    Overflow processing of undefined shift behavior

    None

    foverflow-shift-alt-behavior

    fno-overflow-shift-alt-behavior is disabled by default.

    For an undefined shift behavior that exceeds the bit width of the integer data type, such as, (int) a << 40, the BiSheng compiler optimizes the expression to an integer constant in advance to prevent the expression from being identified and tuned to different values in different tunings. This option is disabled by default.

    Unified control

    None

    ffp-compatibility=17/18/21

    ffp-compatibility=17/18/21

    A common option, which is used to control all enabled options for ensuring calculation result consistency between the current platform and a non-Arm platform.

    Default precision of floating point

    None

    fdefault-double-8

    Single precision by default

    The floating point is single-precision by default. Enable this option to change the floating point to double-precision.

    Floating-point conversion

    None

    freal-10-real-16

    freal-10-real-4

    freal-10-real-8

    Not required for conversion by default

    Convert a 10-bit floating point number to 16, 4 or 8 bits.

    Register optimization

    None

    modd-spreg

    mno-odd-spreg by default

    Enable or disable odd-bit single-precision floating-point registers.

    Symbol 0 optimization

    None

    fsigned-zeros

    fno-signed-zeros by default

    Allow optimization to ignore differences with symbol 0.

    NAN optimization

    None

    fhonor-nans

    fno-honor-nans by default

    Assume that all NaNs have no impact. The Clang version also ignores the NaN that has no impact.

    INF optimization

    None

    fhonor-infinities

    fno-honor-infinities by default

    Assume that there is no infinite value.