Tuning Strategy
The logic for optimizing the compiler precision is as follows:
- Set the compilation option of the Kunpeng and x86 platforms to -O1 and use the same open source SLEEF math library. In this circumstance, check whether the precision of Kunpeng is the same as that of x86, or whether the precision is improved compared with that of -O3. Because ICC has the loop combining bug, you may need to use the -O1 option for specific files.Figure 1 ICC loop combining bug

According to the code semantics, in this loop, the calculation process of SUMDP is to add several numbers in the sequence of the array index. According to the test result, the SUMDP calculation results are different when ICC uses the -O3 and -O0/-O1 compilation configurations.
This example uses the binary differences in the sum of the sequential addition of six numbers (the correct result by manual calculation is 0b01000110010110001100000000110100):
- When ICC uses -fp-model precise -O3 compilation configuration, the result is 0b01000110010110001100000000110011.
- When ICC uses -fp-model precise -O0 or -fp-model precise -O1 compilation configuration, the result is 0b01000110010110001100000000110100.
- When BiSheng uses -O3 or -O0 compilation configuration, the result is 0b01000110010110001100000000110100.
Therefore, it can be inferred that there is precision loss when ICC uses the -fp-model precise -O3 compilation configuration to optimize the loop. This tuning conflicts with -fp-model precise, which is a bug of ICC.
In the WRF source code, the code that has the bug are as follows:
- physics/module_cu_bmj.F (corresponding physical parameter: cu_physics=2)
- physics/module_cu_gf_deep.F (corresponding physical parameter: cu_physics=3)
- physics/module_ra_goddard.F (corresponding physical parameter: ra_lw_physics=5/ra_sw_physics=5)
- After the difference of the option O1 is excluded, select compilation options from Table 1 for tuning based on the pre-set precision benchmark. In the following policies, the optimal precision mode is most commonly used.
Table 1 Precision tuning policy Precision Formula
x86 Option
BiSheng Option
Best Result
Optimal performance mode
O3
O3 ffp-model=fast
Approximate
Optimal precision mode
O3 -fp-model=precise
-no-ftz -init=zero -init=arrays
O3 -faarch64-pow-alt-precision=21 -enable--alt-precision-math-functions km_l9 -Hx,124,0xc00000 -ffp-contract=off -finit-zero -mllvm -disable-sincos-opt -MflushZ
or
-O3 -ffp-compatibility=18/21 -finit-zero -lkm_l9
Consistent
Precision and performance compromise mode
Remove some of the precision options: -no-ftz, -init=zero, and -init=arrays.
Remove some of the precision options: -ffp-contract=off, -finit-zero, and -disable-sincos-opt.
Consistent
Table 2 Main compilation options affecting precision Category
Precision Compilation Option
x86 Option
BiSheng Option
BiSheng Tuning Option
Impact on Precision
Compiler
Compiler version
2018/2021
2.1-3.1
2.1-3.1
Different compiler versions may lead to inconsistent results. In fast mode, different Intel compiler versions lead to different floating-point processing precisions, causing key indicators to fluctuate. Kunpeng cannot be fitted as well. Therefore, use a compiler of a version specified by the customer. For BiSheng, a later version is preferred.
Optimization option
Optimization option
O0-O3-Ofast
O0-O2
O3-Ofast
The O0 option disables all optimizations, which has a great impact on performance. Therefore, you are advised not to enable this option. Optimize the precision based on the O3 option. When the loop-combing bug of ICC occurs, the -O1 option can be used only for specific files.
Floating-point model
Calculation reordering
None
Disabled by default.
funsafe-math-optimizations
The following reordering compilation options can be enabled:
-fno-signed-zeros
-fno-trapping-math
-fassociative-math
-freciprocal-math
Fast calculation
ffast-math
Disabled by default.
ffast-math
The following options are enabled by default:
-fno-honor-infinities
-fno-honor-nans
-fno-math-errno
-ffinite-math
-fassociative-math
-freciprocal-math
-fno-signed-zeros
-fno-trapping-math
-ffp-contract=fast
Unified precision option
None
ffp-compatibility
Disabled by default.
Set -ffp-compatibility=18/21 to enable this option. If the related precision options are not set, the following options are enabled:
Setting -ffp-compatibility=21 -fp-model=precise is equivalent to setting -ffp-contract=off -faarch64-pow-alt-precision=21 -Hx,124,0xc00000 -mllvm -enable-alt-precision-math-functions.
The following lists the precision options of BiSheng compared with Intel 2018:
Setting -ffp-compatibility=18 -fp-model=precise is equivalent to setting -ffp-contract=off -faarch64-pow-alt-precision=21 -Hx,124,0x800000 -mllvm -enable-alt-precision-math-functions -enable-18-math-compatibility.
The -ffp-compatibility option is not added to the following precision options due to specific reasons:
- -finit-zero: Initializes the uninitialized variables in Fortran to zero. The Intel compiler uses an independent option for controlling. In addition, some applications cannot run properly after the option is added to the Intel compiler. In this case, BiSheng fails to enable the corresponding options during precision tuning. Therefore, the -ffp-compatibility option is not added.
- -fp-model=: -ffp-compatibility may support -fp-model=fast in the future, the behavior of -ffp-compatibility can be adjusted based on -fp-model to fit precise and fast of ICC, respectively. Therefore, the -ffp-compatibility option is not added.
- -mllvm -disable-sincos-opt: It is enabled by default if -fp-model=precise is set.
- -enable-alt-precision-math-functions -enable-18-math-compatibility: This option generates KML precision interfaces for calling, which need to be bound to the KML.
Floating-point precision option
fp-model=precise
ffp-model=precise by default
ffp-model=fast
ffp-model has the following three modes:
- The precise mode (by default) is equivalent to setting
-ffp-contract=fast -fno-rounding-math.
- The strict mode is equivalent to setting
-ftrapping-math -frounding-math -ffp-exception-behavior=strict.
- The fast mode is equivalent to setting
-menable-no-infs -menable-no-nans -menable-unsafe-fp-math -fno-signed-zeros -mreassociate
-freciprocal-math -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only.
Math function
The sincos function optimization
None
disable-sincos-opt
Default
BiSheng processes consecutive sin and cos calculations collectively, which causes precision differences.
The min/max function optimization
None
faarch64-minmax-alt-precision
Default
Modify the tuning policy of the min/max function so that the calculation result of the min/max function is consistent with that of a non-Arm platform.
recip instruction optimization
None
mllvm -aarch64-recip-alt-precision
Default
Use soft floating-point compensation so that the calculation result of the recip instruction is consistent with that of a non-Arm platform.
rsqrt instruction optimization
None
mllvm -aarch64-rsqrt-alt-precision
Default
Use soft floating-point compensation so that the calculation result of the rsqrt instruction is consistent with that of a non-Arm platform.
Reciprocal of square root optimization
None
mllvm -disable-recip-sqrt-opt=<true|false>
Default
In the fastmath scenario, optimize A = (C / sqrt(Y)); B = A * A to use fewer instructions in the calculation. The value true indicates that the optimization is disabled. The default value false indicates that the optimization is enabled.
Fused multiply-add
None
ffma-combine-fdiv
Default
A common option used to optimize the expression a/b+c to fma(a, 1/b, c), which makes calculation results consistent with those on a non-Arm platform. This option takes effect only when -ffp-contract=fast is set.
Fused multiply-add
None
ffma-reverse-associative
Default
A common option used to optimize the expression ab+cd to fma(a, b, c*d), which makes calculation results consistent with those on a non-Arm platform. This option takes effect only when -ffp-contract=fast is set.
Law of commutativity and law of associativity permission
None
fassociative-math
Default
Allow the law of commutativity and law of associativity.
Division function optimization
None
freciprocal-math
Default
Allow conversion from division to multiplication by reciprocal.
Optimization of loose inline math functions
None
frelaxed-math
Default
Use inline math functions.
Rounding error
fp-port
fno-rounding-math
frounding-math by default
Specify whether to use rounding of the IEEE 754 standard. By default, rounding to the nearest value is enabled.
CRC32 calculation optimization
None
mllvm -enable-gzipcrc32=false
mllvm -enable-gzipcrc32=true by default
Identify the CRC32 calculation logic in the code and use the built-in instructions of the processor to accelerate the calculation. The value true (by default) indicates that the optimization is enabled. The value false indicates that the optimization is disabled.
INF optimization
None
ffinite-math-only
Default
Assume that there is no infinity or NaN. In the Clang version, the option is equivalent to -fno-honor-nans -fno-honor-inifinities.
Math function optimization
None
Default
menable-unsafe-fp-math
Allow unsafe floating-point math function optimization which may reduce precision.
Math library
The csqrtf and zsqrtf functions fitting
IMF by default
faarch64-pow-alt-precision=18/21
libm/pgmath by default
The square root results of complex numbers are inconsistent on the two platforms.
The pow function fitting
Default IMF 80-bit precision version powr8i4
faarch64-pow-alt-precision=18/21
libm by default
The tuning of common expressions for calculating exponents, such as a**4 and 2**a, is determined by the compiler. As a result, there is significant difference between ICC and BiSheng compiler in such expressions. To solve this problem, the -faarch64-pow-alt-precision=18/21 option is developed on the compiler side to fit ICC's tuning. This option enables various pow optimization expressions to be consistent with ICC at optimization levels such as O0-Ofast. If the parameter is set to 18, the expression optimization is consistent with ICC 2018.1. If this parameter is set to 21, the expression optimization is consistent with ICC 2021.
The log function enablement
IMF by default
Disabled by default.
mllvm -enable-18-math-compatibility
Replace math functions such as tgammaf, cbrt, log, and log10 with functions suffixed with _18 to control the precision of math functions (used together with the KML). This optimization option takes effect only on O1 or upper levels, when -mllvm -enable-alt-precision-math-functions is enabled.
The sin/cos function precision enablement
IMF by default
enable-alt-precision-math-functions km_l9
libm by default
Different math libraries have slightly different implementations of some math functions, such as asindf, cosdf, cbrt, powr8i4 and exp2. Although the number of differences and the bits of differences (for example, only the last bit is different) are small, the errors can be amplified in the iterative calculation of the application.
WRF testing shows: There is error in the last bit of the output data of atan2f, which is amplified in the subsequent WRF calculations. As a result, the accumulated precipitation in the 24-hour weather forecast result exceeds 50 mm (50 mm precipitation in actual weather is the rainstorm level).
Implement functions based on the ICC math library algorithm and perform verification by overwriting sufficient data. This optimization option takes effect only when the optimization level is O1 or higher. The logic is to change the function names __mth_i_cosd, of __mth_i_asind and __pd_powi_1 to cosdf, asindf, and powr8i4 respectively.
Floating-point control
Immediate data precision bug fixing
None
124,0xc00000
Not fixed by default
There is some immediate data in the code. When a small number of the immediate are accessed by different compilers, the data that is read to the memory is inconsistent. Take 0.002362E12_8 as an example:
When read by ICC to the memory, the value (hexadecimal) is 41E1992860000000.
When read by BiSheng to the memory, the value (hexadecimal) is 41E1992840000000.
When read by GCC to the memory, the value is (hexadecimal) 41E1992840000000.
Add this compilation option to solve this problem.
FMA
[no-]fma
ffp-contract=off
ffp-contract=on by default
Specify whether to generate the fused multiply-add calculation.
ftz fitting
[no-]ftz
clang: fdenormal-fp-math=[ieee|preserve-signs|positive-zero]
flang-SSE-x86: Mflushz
flang: -Mdaz
Not required for processing by default
The x86 and Kunpeng processors are different in processing FTZ instruction:
- They are different in the order of performing rounding and FTZ judgment (which one first). x86 performs rounding before judgment, while Kunpeng performs judgment before rounding.
- They have different definitions of FTZ boundary values. x86 complies with the IEEE standard, that is, subnormal numbers are not processed by default. However, Kunpeng processes boundary values based on the following rules:
- preserve-signs retains subnormal number symbols.
- positive-zero/Mdaz updates subnormal number to positive 0.
Uninitialized array
init=zero -init=arrays
finit-zero
Random initialization by default
Uninitialized scenarios are initialized by the compiler, for which the Fortran language standard lacks specifications. Therefore, there is a lack of consistency in the initialization scope and mode. The BiSheng compiler is well developed for these scenarios. Currently, the -finit-zero option can be used to implement secure initialization.
Sum of absolute values
None
mllvm -sad-pattern-recognition=false
mllvm -sad-pattern-recognition=true
Optimize the absolute sum calculation of the difference (sum += abs(a[i] - b[i])) to generate a simplified and efficient calculation sequence. The value true (by default) indicates that the option is enabled.
Vectorization calculation
None
mllvm -aarch64-hadd-generation=false
mllvm -aarch64-hadd-generation=true
Use an Arm NEON instruction URHADD to complete the vectorization calculation ((x[i] + y[i] + 1) >> 1), to generate better code. The value true (by default) indicates that the optimization is enabled.
Reduction sequence
None
mllvm -instcombine-reorder-sum-of-reduce-add=false
mllvm -instcombine-reorder-sum-of-reduce-add=true
Change the sequence of reduction operations to generate better reduction code. The value true (by default) indicates that the optimization is enabled.
Overflow processing of undefined shift behavior
None
foverflow-shift-alt-behavior
fno-overflow-shift-alt-behavior is disabled by default.
For an undefined shift behavior that exceeds the bit width of the integer data type, such as, (int) a << 40, the BiSheng compiler optimizes the expression to an integer constant in advance to prevent the expression from being identified and tuned to different values in different tunings. This option is disabled by default.
Unified control
None
ffp-compatibility=17/18/21
ffp-compatibility=17/18/21
A common option, which is used to control all enabled options for ensuring calculation result consistency between the current platform and a non-Arm platform.
Default precision of floating point
None
fdefault-double-8
Single precision by default
The floating point is single-precision by default. Enable this option to change the floating point to double-precision.
Floating-point conversion
None
freal-10-real-16
freal-10-real-4
freal-10-real-8
Not required for conversion by default
Convert a 10-bit floating point number to 16, 4 or 8 bits.
Register optimization
None
modd-spreg
mno-odd-spreg by default
Enable or disable odd-bit single-precision floating-point registers.
Symbol 0 optimization
None
fsigned-zeros
fno-signed-zeros by default
Allow optimization to ignore differences with symbol 0.
NAN optimization
None
fhonor-nans
fno-honor-nans by default
Assume that all NaNs have no impact. The Clang version also ignores the NaN that has no impact.
INF optimization
None
fhonor-infinities
fno-honor-infinities by default
Assume that there is no infinite value.