Causes of Calculation Result Differences

Processor Architecture

The architecture of the Kunpeng processor is different from that of the x86 processor. For example, the x86 processor has 40-bit and 80-bit floating-point compute units, but the Kunpeng processor does not have these compute units. If these compute units are involved, the calculation results of the two platforms are inconsistent. Currently, software simulation can be performed on the Kunpeng platform to solve differences caused by this issue. For example, the Intel C++ Compiler (ICC) math library uses the powr8i4 function, which calls 80-bit x87 instructions. That is, a floating-point number occupies 80 bits. On the Kunpeng platform, you can use the open-source multiple-precision floating-point reliable (MPFR) library that supports calculation of multiple precisions to implement the pow calculation of 80-bit floating-point numbers.

Flush-to-zero (FTZ) and denormals-are-zero (DAZ):
FTZ and DAZ are standards for processing denormalized numbers. For details about the definition of denormalized numbers, see Introduction to Floating-Point Numbers. FTZ indicates that the floating-point operation result is converted to 0 when the result is a denormalized number. DAZ indicates that the input is converted to 0 when it is a denormalized number. FTZ conversion is done by hardware, which accelerates denormalized number calculation. The following code is an example:
```
float a = 1.0e-5 * 1.0e-38;
cout << "a = " << a << endl;
```
When FTZ is disabled, the output is as follows:
```
a = 9.94922e-44
```
When FTZ is enabled by adding the -ffast-math compilation option to the GCC, the output is as follows:
```
a = 0
```
Check the assembly code (code on the Kunpeng platform) after the -ffast-math compilation option is added. The following assembly instruction is added:
```
0000000000400770 <set_fast_math>:
  400770:       52a02000        mov     w0, #0x1000000                  // #16777216
  400774:       d51b4400        msr     fpcr, x0
  400778:       d65f03c0        ret
  40077c:       00000000        .inst   0x00000000 ; undefined
```
FTZ is enabled or disabled by assigning a value to the floating-point control register (FPCR). Both Kunpeng and x86 support FTZ. However, the hardware architectures of Kunpeng and x86 are different, and Kunpeng and x86 perform FTZ and rounding on denormalized numbers in different sequences. The code is as follows:
```
float a = 1.09628081709e-33;
float b = 1.07225660031e-05;
float c = a * b;
cout  << setprecision(12) << "c = " <<  c << endl;
```
Run the g++ -ffast-math command to enable FTZ on a Kunpeng device. The output is as follows:
```
c = 0
```
Run the g++ -ffast-math command to enable FTZ on an x86 device. The output is as follows:
```
c = 1.17549435082e-38
```
1.17549435082e-38(0x800000) is a single-precision floating-point number that is nearest to 0 in the IEEE 754 standard. In scientific computing, 1.09628081709e-33 x 1.07225660031e-05 = 1.17549434192e-38. This result is smaller than the value of c, and is rounded to the nearest floating-point number 1.17549435082e-38(0x800000) on x86. The following compares the actions performed on Kunpeng and x86.
- Kunpeng execution sequence
  1. Perform the multiplication operation: a * b = 1.17549434192E-38.
  2. Perform FTZ. 1.17549434192e-38 is a denormalized number and is converted to 0.
  3. Perform rounding. 0 is rounded to its nearest floating-point number, that is, 0.
  4. Assign 0 to variable c.
- x86 execution sequence
  1. Perform the multiplication operation: a * b = 1.17549434192E-38.
  2. Round 1.17549434192e-38 to the nearest floating-point number 1.17549435082e-38(0x800000).
  3. Perform FTZ. 1.17549435082e-38(0x800000) is not a denormalized number and will not be converted to 0.
  4. Assign 1.17549435082e-38(0x800000) to variable c.
Floating-point number conversion overflow:
When a floating-point number is converted to an integer, the integer part is retained and the decimal part is omitted. When the floating-point number is large and its integer part cannot be represented using an integer, the actions taken by Kunpeng and x86 are different. Kunpeng clearly processes overflow or underflow, that is, to reserve the maximum or minimum value that can represent the integer part of the floating-point number. x86 defines an indefinite integer value to represent the overflow value. The indefinite integer value for double-precision conversion is 0x8000 0000 0000 0000, and the indefinite integer value for single-precision conversion is 0x8000 0000. The following code is an example:
```
double a = 1.0e20;
long b = static_cast<long>(a);
std::cout << "b = "  << b << std::endl;
```
The output on the Kunpeng platform is as follows:
```
b = 9223372036854775807
```
The output on the x86 platform is as follows:
```
b = -9223372036854775807
```
In the preceding example, 1.0e20 exceeds the maximum integer that can be represented by long. The Kunpeng platform assigns the maximum integer that can be represented by long to b, that is, 9223372036854775807(0x7FFF FFFF FFFF FFFF). The x86 platform assigns the indefinite integer value for double-precision conversion to b, that is, -9223372036854775807(0x8000 0000 0000 0000).

For details about the comparison between the Kunpeng and x86 results when floating-point numbers overflow during conversion, see Data Overflow During the Conversion from the Double-Precision Floating-Point Type to the Integer Type in the Kunpeng Server Code Porting Reference.

Compiler

Compilation option -fp-model:
The ICC compilation option -fp-model is used to control whether floating-point numbers are value-safe. That is, whether the compiler performs equal operation conversion in mathematical logic. If floating-point operations are not value-safe, calculation results may be different. The default value of -fp-model is fast=1, which is enabled by default when the optimization level is -O2 or higher. You can specify -fp-model=precise to ensure that floating-point operations are value-safe.
Floating-point numbers cannot accurately represent real numbers. The default rounding mode of Kunpeng and x86 is to round a floating-point number to the nearest real number. Rounding brings about precision loss, which may cause results of computer and mathematical calculation to be inconsistent. For example, A + B + C may not be equal to A + (B + C). See the following code:
```
float a = -0.1;
float b = 0.1;
float c = -0.00000000149012;
float sum1 = a + b + c;
float sum2 = a + (b + c);
std::cout << "sum1 = " << sum1 << std::endl;
std::cout << "sum2 = " << sum2 << std::endl;
```
Output:
```
sum1 = -1.49012e-09
sum2 = 0
```
In mathematical calculation, sum1 should be equal to sum2. The difference occurs when (b + c) is calculated. Because c (-0.00000000149012) is about one fifth of the unit in the last place (ULP) of b (0.1), c is rounded to the nearest number, that is, 0. As a result, (b + c) = b, and sum2 = a + b = -0.1 + 0.1= 0. For details about the definition of ULP, see Introduction to Floating-Point Numbers.

The compiler optimizes code performance and sometimes implicitly disrupts the calculation order. Different compilers have different optimization policies. As a result, the running results of different compilers with the same code are different. The following code is an example:
```
#include <iostream>
float a[4 * 4] = {0};
float sum = 0;
void main ()
{
    a[0] = 0.1;
    a[1] = -0.1;
    a[2] = 0.00000000149012;
    for (int i = 0; i < 4 * 4; i++) {
        sum += a[i];
    }
    std::cout << "sum = " << sum << std::endl;
}
```
When ICC -O3 is used, the output is as follows:
```
sum = 0
```
When G++ -O3 is used, the output is as follows:
```
sum = 1.49012e-09
```
The calculation result using ICC is the same as that of a + (b + c) in the preceding example. When the optimization level is -O3, -fp-model=fast is enabled for ICC by default. When a[i] is vectorized, conversions that comply with the mathematical logic are performed. As a result, the calculation is not performed in the sequence of a[0] + a[1] + a[2]... However, GCC does not perform vectorized optimization, such a calculation sequence is retained. Therefore, the calculation results are inconsistent.

When ICC -O3 -fp-model=precise is used, the output is as follows:
```
sum = 1.49012e-09
```
When -fp-model=precise is specified for ICC, it disables conversions that are equal in mathematical logic, including but not limited to the following:

a + b + c → a + (b + c), x * 0 → 0, x + 0 → 0, A/B → A * (1/B)
Immediate:
Immediates are displayed in code. For an immediate whose type is undefined, its value loaded to the memory is not defined in IEEE 754. As a result, when different compilers load the immediate in different ways, its value in the memory may be different. The following code is an example:
```
program test
real (8) :: x
x = 0.002362E12
write(*, '(Z16)') x
end
```
When GCC is used, the output is as follows:
```
41E1992840000000
```
When ICC is used, the output is as follows:
```
41E1992860000000
```
If the specified immediate type is a double-precision floating-point number, no difference occurs.
```
program test
real (8) :: x
x = 0.002362E12_8! The specified type is double-precision.
write(*, '(Z16)') x
end
```
When GCC is used, the output is as follows:
```
41E1992850000000
```
When ICC is used, the output is as follows:
```
41E1992850000000
```

Math Library

If the calculation error of a function is less than 0.5 ULP in any case, the function can return an approximation that is nearest to the exact result. However, there is no unified standard to restrict the output of functions such as log, sin, and pow in various math libraries. As a result, the rounding mode and precision calculation of functions in different math libraries may vary. The following code is an example:

program test
real::x, y, z
x = 0.699129045
y = 1.4
z = x ** y
print*, z
end

When the system math library libm.so.6 is used, the output is as follows:

0.605871201

When the Kunpeng Math Library (KML) libkm.so is used, the output is as follows:

0.605871141

Application Code

Differences caused by WRF grid division. See the following code:
```
ht_loc(its-1, j) = max(ht_loc(its-1, j), ht_shad(its-1, j))
ht_loc(ite-1, j) = max(ht_loc(ite-1, j), ht_shad(ite-1, j))
ht_loc(i, jts-1) = max(ht_loc(i, jts-1), ht_shad(i, jts-1))
ht_loc(i, jte-1) = max(ht_loc(i, jte-1), ht_shad(i, jte-1))
```
The variables its, ite, jts, and jte in code indicate the grid area. The grid area is related to the number of processes in use. If the number of processes is large, the grid area allocated to each process is small. Otherwise, the grid area allocated to each process is large. When the test scale (numbers of test nodes and threads) is changed, the values of the variables its, ite, jts, jte, and ht_loc change. After multiple iterations, the output of the entire program changes.
Differences cased by undefined behaviors in code, such as out-of-bounds array bugs. See the following code:
```
real :: SLA_TABLE(27)
// i is an external input.
masslai =  max(SLA_TABLE(i), 1.0)
```
In code, the array size of SLA_TABLE is 27. When the i variable is greater than 27, the array is out of bounds, and the values of the SLA_TABLE(i) array and the masslai variable are uncertain. As a result, the final output is uncertain.