Tuning Strategy

The logic for optimizing the compiler precision is as follows:

Set the compilation option of the Kunpeng and x86 platforms to -O1 and use the same open source SLEEF math library. In this circumstance, check whether the precision of Kunpeng is the same as that of x86, or whether the precision is improved compared with that of -O3. Because ICC has the loop combining bug, you may need to use the -O1 option for specific files.
Figure 1 ICC loop combining bug

According to the code semantics, in this loop, the calculation process of SUMDP is to add several numbers in the sequence of the array index. According to the test result, the SUMDP calculation results are different when ICC uses the -O3 and -O0/-O1 compilation configurations.

This example uses the binary differences in the sum of the sequential addition of six numbers (the correct result by manual calculation is 0b01000110010110001100000000110100):
- When ICC uses -fp-model precise -O3 compilation configuration, the result is 0b01000110010110001100000000110011.
- When ICC uses -fp-model precise -O0 or -fp-model precise -O1 compilation configuration, the result is 0b01000110010110001100000000110100.
- When BiSheng uses -O3 or -O0 compilation configuration, the result is 0b01000110010110001100000000110100.
Therefore, it can be inferred that there is precision loss when ICC uses the -fp-model precise -O3 compilation configuration to optimize the loop. This tuning conflicts with -fp-model precise, which is a bug of ICC.
In the WRF source code, the code that has the bug are as follows:
- physics/module_cu_bmj.F (corresponding physical parameter: cu_physics=2)
- physics/module_cu_gf_deep.F (corresponding physical parameter: cu_physics=3)
- physics/module_ra_goddard.F (corresponding physical parameter: ra_lw_physics=5/ra_sw_physics=5)

After the difference of the option O1 is excluded, select compilation options from Table 1 for tuning based on the pre-set precision benchmark. In the following policies, the optimal precision mode is most commonly used.

**Table 1** Precision tuning policy
Precision Formula	x86 Option	BiSheng Option	Best Result
Optimal performance mode	O3	O3 ffp-model=fast	Approximate
Optimal precision mode	O3 -fp-model=precise -no-ftz -init=zero -init=arrays	O3 -faarch64-pow-alt-precision=21 -enable--alt-precision-math-functions km_l9 -Hx,124,0xc00000 -ffp-contract=off -finit-zero -mllvm -disable-sincos-opt -MflushZ or -O3 -ffp-compatibility=18/21 -finit-zero -lkm_l9	Consistent
Precision and performance compromise mode	Remove some of the precision options: -no-ftz, -init=zero, and -init=arrays.	Remove some of the precision options: -ffp-contract=off, -finit-zero, and -disable-sincos-opt.	Consistent

**Table 2** Main compilation options affecting precision
Category	Precision Compilation Option	x86 Option	BiSheng Option	BiSheng Tuning Option	Impact on Precision
Compiler	Compiler version	2018/2021	2.1-3.1	2.1-3.1	Different compiler versions may lead to inconsistent results. In fast mode, different Intel compiler versions lead to different floating-point processing precisions, causing key indicators to fluctuate. Kunpeng cannot be fitted as well. Therefore, use a compiler of a version specified by the customer. For BiSheng, a later version is preferred.
Optimization option	Optimization option	O0-O3-Ofast	O0-O2	O3-Ofast	The O0 option disables all optimizations, which has a great impact on performance. Therefore, you are advised not to enable this option. Optimize the precision based on the O3 option. When the loop-combing bug of ICC occurs, the -O1 option can be used only for specific files.
Floating-point model	Calculation reordering	None	Disabled by default.	funsafe-math-optimizations	The following reordering compilation options can be enabled: -fno-signed-zeros -fno-trapping-math -fassociative-math -freciprocal-math
	Fast calculation	ffast-math	Disabled by default.	ffast-math	The following options are enabled by default: -fno-honor-infinities -fno-honor-nans -fno-math-errno -ffinite-math -fassociative-math -freciprocal-math -fno-signed-zeros -fno-trapping-math -ffp-contract=fast
	Unified precision option	None	ffp-compatibility	Disabled by default.	Set -ffp-compatibility=18/21 to enable this option. If the related precision options are not set, the following options are enabled: Setting -ffp-compatibility=21 -fp-model=precise is equivalent to setting -ffp-contract=off -faarch64-pow-alt-precision=21 -Hx,124,0xc00000 -mllvm -enable-alt-precision-math-functions. The following lists the precision options of BiSheng compared with Intel 2018: Setting -ffp-compatibility=18 -fp-model=precise is equivalent to setting -ffp-contract=off -faarch64-pow-alt-precision=21 -Hx,124,0x800000 -mllvm -enable-alt-precision-math-functions -enable-18-math-compatibility. The -ffp-compatibility option is not added to the following precision options due to specific reasons: -finit-zero: Initializes the uninitialized variables in Fortran to zero. The Intel compiler uses an independent option for controlling. In addition, some applications cannot run properly after the option is added to the Intel compiler. In this case, BiSheng fails to enable the corresponding options during precision tuning. Therefore, the -ffp-compatibility option is not added. -fp-model=: -ffp-compatibility may support -fp-model=fast in the future, the behavior of -ffp-compatibility can be adjusted based on -fp-model to fit precise and fast of ICC, respectively. Therefore, the -ffp-compatibility option is not added. -mllvm -disable-sincos-opt: It is enabled by default if -fp-model=precise is set. -enable-alt-precision-math-functions -enable-18-math-compatibility: This option generates KML precision interfaces for calling, which need to be bound to the KML.
	Floating-point precision option	fp-model=precise	ffp-model=precise by default	ffp-model=fast	ffp-model has the following three modes: The precise mode (by default) is equivalent to setting -ffp-contract=fast -fno-rounding-math. The strict mode is equivalent to setting -ftrapping-math -frounding-math -ffp-exception-behavior=strict. The fast mode is equivalent to setting -menable-no-infs -menable-no-nans -menable-unsafe-fp-math -fno-signed-zeros -mreassociate -freciprocal-math -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only.
Math function	The sincos function optimization	None	disable-sincos-opt	Default	BiSheng processes consecutive sin and cos calculations collectively, which causes precision differences.
	The min/max function optimization	None	faarch64-minmax-alt-precision	Default	Modify the tuning policy of the min/max function so that the calculation result of the min/max function is consistent with that of a non-Arm platform.
	recip instruction optimization	None	mllvm -aarch64-recip-alt-precision	Default	Use soft floating-point compensation so that the calculation result of the recip instruction is consistent with that of a non-Arm platform.
	rsqrt instruction optimization	None	mllvm -aarch64-rsqrt-alt-precision	Default	Use soft floating-point compensation so that the calculation result of the rsqrt instruction is consistent with that of a non-Arm platform.
	Reciprocal of square root optimization	None	mllvm -disable-recip-sqrt-opt=<true\|false>	Default	In the fastmath scenario, optimize *A = (C / sqrt(Y)); B = A A to use fewer instructions in the calculation. The value true indicates that the optimization is disabled. The default value false** indicates that the optimization is enabled.
	Fused multiply-add	None	ffma-combine-fdiv	Default	A common option used to optimize the expression a/b+c to fma(a, 1/b, c), which makes calculation results consistent with those on a non-Arm platform. This option takes effect only when -ffp-contract=fast is set.
	Fused multiply-add	None	ffma-reverse-associative	Default	A common option used to optimize the expression ab+cd to *fma(a, b, cd), which makes calculation results consistent with those on a non-Arm platform. This option takes effect only when -ffp-contract=fast** is set.
	Law of commutativity and law of associativity permission	None	fassociative-math	Default	Allow the law of commutativity and law of associativity.
	Division function optimization	None	freciprocal-math	Default	Allow conversion from division to multiplication by reciprocal.
	Optimization of loose inline math functions	None	frelaxed-math	Default	Use inline math functions.
	Rounding error	fp-port	fno-rounding-math	frounding-math by default	Specify whether to use rounding of the IEEE 754 standard. By default, rounding to the nearest value is enabled.
	CRC32 calculation optimization	None	mllvm -enable-gzipcrc32=false	mllvm -enable-gzipcrc32=true by default	Identify the CRC32 calculation logic in the code and use the built-in instructions of the processor to accelerate the calculation. The value true (by default) indicates that the optimization is enabled. The value false indicates that the optimization is disabled.
	INF optimization	None	ffinite-math-only	Default	Assume that there is no infinity or NaN. In the Clang version, the option is equivalent to -fno-honor-nans -fno-honor-inifinities.
	Math function optimization	None	Default	menable-unsafe-fp-math	Allow unsafe floating-point math function optimization which may reduce precision.
Math library	The csqrtf and zsqrtf functions fitting	IMF by default	faarch64-pow-alt-precision=18/21	libm/pgmath by default	The square root results of complex numbers are inconsistent on the two platforms.
	The pow function fitting	Default IMF 80-bit precision version powr8i4	faarch64-pow-alt-precision=18/21	libm by default	The tuning of common expressions for calculating exponents, such as a4 and 2a, is determined by the compiler. As a result, there is significant difference between ICC and BiSheng compiler in such expressions. To solve this problem, the -faarch64-pow-alt-precision=18/21 option is developed on the compiler side to fit ICC's tuning. This option enables various pow optimization expressions to be consistent with ICC at optimization levels such as O0-Ofast. If the parameter is set to 18, the expression optimization is consistent with ICC 2018.1. If this parameter is set to 21, the expression optimization is consistent with ICC 2021.
	The log function enablement	IMF by default	Disabled by default.	mllvm -enable-18-math-compatibility	Replace math functions such as tgammaf, cbrt, log, and log10 with functions suffixed with _18 to control the precision of math functions (used together with the KML). This optimization option takes effect only on O1 or upper levels, when -mllvm -enable-alt-precision-math-functions is enabled.
	The sin/cos function precision enablement	IMF by default	enable-alt-precision-math-functions km_l9	libm by default	Different math libraries have slightly different implementations of some math functions, such as asindf, cosdf, cbrt, powr8i4 and exp2. Although the number of differences and the bits of differences (for example, only the last bit is different) are small, the errors can be amplified in the iterative calculation of the application. WRF testing shows: There is error in the last bit of the output data of atan2f, which is amplified in the subsequent WRF calculations. As a result, the accumulated precipitation in the 24-hour weather forecast result exceeds 50 mm (50 mm precipitation in actual weather is the rainstorm level). Implement functions based on the ICC math library algorithm and perform verification by overwriting sufficient data. This optimization option takes effect only when the optimization level is O1 or higher. The logic is to change the function names __mth_i_cosd, of __mth_i_asind and __pd_powi_1 to cosdf, asindf, and powr8i4 respectively.
Floating-point control	Immediate data precision bug fixing	None	124,0xc00000	Not fixed by default	There is some immediate data in the code. When a small number of the immediate are accessed by different compilers, the data that is read to the memory is inconsistent. Take 0.002362E12_8 as an example: When read by ICC to the memory, the value (hexadecimal) is 41E1992860000000. When read by BiSheng to the memory, the value (hexadecimal) is 41E1992840000000. When read by GCC to the memory, the value is (hexadecimal) 41E1992840000000. Add this compilation option to solve this problem.
	FMA	[no-]fma	ffp-contract=off	ffp-contract=on by default	Specify whether to generate the fused multiply-add calculation.
	ftz fitting	[no-]ftz	clang: fdenormal-fp-math=[ieee\|preserve-signs\|positive-zero] flang-SSE-x86: Mflushz flang: -Mdaz	Not required for processing by default	The x86 and Kunpeng processors are different in processing FTZ instruction: They are different in the order of performing rounding and FTZ judgment (which one first). x86 performs rounding before judgment, while Kunpeng performs judgment before rounding. They have different definitions of FTZ boundary values. x86 complies with the IEEE standard, that is, subnormal numbers are not processed by default. However, Kunpeng processes boundary values based on the following rules: preserve-signs retains subnormal number symbols. positive-zero/Mdaz updates subnormal number to positive 0.
	Uninitialized array	init=zero -init=arrays	finit-zero	Random initialization by default	Uninitialized scenarios are initialized by the compiler, for which the Fortran language standard lacks specifications. Therefore, there is a lack of consistency in the initialization scope and mode. The BiSheng compiler is well developed for these scenarios. Currently, the -finit-zero option can be used to implement secure initialization.
	Sum of absolute values	None	mllvm -sad-pattern-recognition=false	mllvm -sad-pattern-recognition=true	Optimize the absolute sum calculation of the difference (sum += abs(a[i] - b[i])) to generate a simplified and efficient calculation sequence. The value true (by default) indicates that the option is enabled.
	Vectorization calculation	None	mllvm -aarch64-hadd-generation=false	mllvm -aarch64-hadd-generation=true	Use an Arm NEON instruction URHADD to complete the vectorization calculation ((x[i] + y[i] + 1) >> 1), to generate better code. The value true (by default) indicates that the optimization is enabled.
	Reduction sequence	None	mllvm -instcombine-reorder-sum-of-reduce-add=false	mllvm -instcombine-reorder-sum-of-reduce-add=true	Change the sequence of reduction operations to generate better reduction code. The value true (by default) indicates that the optimization is enabled.
	Overflow processing of undefined shift behavior	None	foverflow-shift-alt-behavior	fno-overflow-shift-alt-behavior is disabled by default.	For an undefined shift behavior that exceeds the bit width of the integer data type, such as, (int) a << 40, the BiSheng compiler optimizes the expression to an integer constant in advance to prevent the expression from being identified and tuned to different values in different tunings. This option is disabled by default.
	Unified control	None	ffp-compatibility=17/18/21	ffp-compatibility=17/18/21	A common option, which is used to control all enabled options for ensuring calculation result consistency between the current platform and a non-Arm platform.
	Default precision of floating point	None	fdefault-double-8	Single precision by default	The floating point is single-precision by default. Enable this option to change the floating point to double-precision.
	Floating-point conversion	None	freal-10-real-16 freal-10-real-4 freal-10-real-8	Not required for conversion by default	Convert a 10-bit floating point number to 16, 4 or 8 bits.
	Register optimization	None	modd-spreg	mno-odd-spreg by default	Enable or disable odd-bit single-precision floating-point registers.
	Symbol 0 optimization	None	fsigned-zeros	fno-signed-zeros by default	Allow optimization to ignore differences with symbol 0.
	NAN optimization	None	fhonor-nans	fno-honor-nans by default	Assume that all NaNs have no impact. The Clang version also ignores the NaN that has no impact.
	INF optimization	None	fhonor-infinities	fno-honor-infinities by default	Assume that there is no infinite value.

Parent topic: Compiler Precision Tuning (Including Math Libraries)