Introduction to Floating-Point Numbers

Definition

According to the IEEE 754 standard, a floating-point number consists of three parts: sign, exponent, and mantissa. For a 32-bit single-precision floating-point number, the sign occupies one bit, the exponent occupies eight bits, and the mantissa occupies 23 bits. It can represent a value ranging from –3.4e38 to +3.4e38. For a 64-bit double-precision floating-point number, the sign occupies one bit, the exponent occupies 11 bits, and the mantissa occupies 52 bits. It can represent a value ranging from –1.79e308 to +1.79e308.

Figure 1 Composition of a floating-point number

Denormalized Number

A denormalized number is a special value in floating-point operations. For a normalized floating-point number, the leading digit of the mantissa is fixed to 1. For example, the normalized floating-point number for floating-point number 0.001234 is expressed as 1.234e-3. However, for a very small floating-point number (for example, 1.234e-40), if the exponent exceeds the exponent range of a single-precision floating-point number, add 0 to the mantissa (for example, 0.01234e-38). Such a floating-point number whose mantissa starts with 0 is a denormalized number. The minimum non-zero positive number that can be represented by a single-precision floating-point number is 1.17549435082e-38(0x800000). Therefore, the denormalized number range of the single-precision floating-point number is -1.17549435082e-38 to 1.17549435082e-38.

ULP

Unit in the last place (ULP) is defined as the minimum precision unit in computer science and numerical analysis. It is the distance between two adjacent floating-point numbers, that is, the value of the floating-point number when its exponent is retained and the least significant digit is 1.

For example, the single-precision floating-point number of the real value 0.1 is 0x3dcccccc(0.099999994039536) or 0x3dcccccd(0.10000000149012). Therefore, the ULP of the real value 0.1 is |0.10000000149012 - 0.099999994039536| = 0.000000007450584.

If the computer uses the downward approximation method, 0.099999994039536 is used to represent 0.1, and the error is 0.1 - 0.099999994039536 = 0.000000005960464 ≈ 0.8 ULP.

If the computer uses the upward approximation method, 0.10000000149012 is used to represent 0.1, and the error is 0.10000000149012 - 0.1 = 0.00000000149012 ≈ 0.2 ULP.

Rounding

The IEEE 754 standard defines four rounding modes:

Round toward nearest: The result is rounded to the nearest number. Even numbers (ending with 0 in binary) are preferred when the result is equally close to the nearest odd number and nearest even number.
Round toward +∞: The result is rounded towards a positive infinite value.
Round toward -∞: The result is rounded towards a negative infinite value.
Round toward zero: The result is rounded towards 0.