NEON Instruction Acceleration

NEON is a SIMD-based technology that can operate multiple pieces of data at the same time based on a single instruction. The NEON instruction is similar to the MMX/SSE/AVX instruction of the Intel CPU. NEON optimizes application performance through vectorized computing and is generally used in image processing, audio and video processing, and parallel data processing, which require a large number of computing resources.

The NEON technology depends on the hardware support of 128-bit NEON registers. An NEON register is a vector register that can store multiple data elements of the same data type.

The following is a register in the AArch64 architecture of Armv8-A:

64-bit register (general register)
Armv8 has 31 64-bit general registers and one special register. Therefore, it can be considered as 31 64-bit X registers or 31 32-bit W registers (lower 32 bits of the X register).

128-bit register (vector register)
Armv8 has 32 128-bit V registers, which can be considered as 32 32-bit S registers or 32 64-bit D registers.

Using the Compiler Capability for Automatic Vectorization Acceleration

Principles

The compiler supports automatic vectorization. It automatically uses the NEON attribute to vectorize code during compilation. Before enabling automatic vectorization, you need to enable the corresponding compilation options. Not all code can be vectorized. The code must comply with certain encoding methods and rules to provide more prompt information for the compiler and further trigger the compiler to perform vectorization of code.

The following compilers support this feature: GCC, LLVM, and Arm compilers for embedded and Linux projects.

Modification Method

Enable automatic vectorization compilation options.

When the GCC compiler uses -O3, the -ftree-vectorize option is automatically enabled. You need to add the -ftree-vectorize option under -O1 and -O2 to perform vectorization. In -O0 mode, vectorization cannot be performed even if -ftree-vectorize is added.
The armcc compiler uses the -vectorize option to enable vectorized compilation. Generally, a higher optimization level, such as -O2 or -O3, is selected to enable the -vectorize option. In -O1 mode, the -vectorize option must be used to enable vectorized compilation. In -O0 mode, the compiler cannot perform vectorized compilation even if the -vectorize option is added.

Vectorization of double-precision floating-point computing is supported only in the AArch64 architecture of Armv8-A. Do not use the double-precision floating-point data type in other architectures because this type prevents the compiler from performing vectorization. The data types supported by each architecture are as follows:

-	Armv7-A/R	Armv8-A/R	Armv8-A
-	-	aarch32	aarch64
Floating-point	32-bit	16-bit/32-bit	16-bit/32-bit/64-bit
Integer	8-bit/16-bit/32-bit	8-bit/16-bit/32-bit/64-bit	8-bit/16-bit/32-bit/64-bit

Run the arch command to check whether the CPU hardware architecture is AArch64 or AArch32.

Trigger code vectorization in coding.
1. When the number of loops is known, a constant is directly transferred instead of a variable. In this way, the compiler can specify the number of loops in advance. When the number of loops is an exponential multiple of 2, the compiler needs to be informed so that vectorization can be performed as much as possible. When the number of loops is not an exponential multiple of 2, the loops can be decomposed.
```
void vecAdd(int *vecA, int *vecB, int *vecC, int len) 
{
    int i;
    // Inform the compiler that the value of len is an integer multiple of 4.
    for (i = 0; i < len * 4; i++) {  
        vecC[i] = vecA[i] + vecB[i];
    }
 }
```
2. Use < instead of <= or != to determine the condition for ending a control loop. < enables the compiler to identify that the loop ends before the variable value. This helps the compiler to perform vectorization.
3. Use the keyword restrict.
  Add the __restrict or __restrict__ keyword to the pointer to inform the compiler that the object has been referenced by the pointer and the content of the object cannot be modified directly or indirectly. In this way, the compiler learns that the current object does not have other dependencies and can perform parallel operations and vectorization. Before using these keywords, ensure that the pointer access areas do not overlap. Otherwise, the calculation result may be incorrect.
```
void vecAdd(int *__restrict__ vecA, int *__restrict__ vecB, int *__restrict__ vecC, int len)
{
    int i;
    for (i = 0; i < len *4; i++) {
        vecC[i] = vecA[i] + vecB[i];
    }
}
```
4. Avoid cyclic dependency (the result of a loop is affected by the result of the previous loop).
5. On the condition that the requirements are met, use the data type as small as possible so that the NEON register can process more data at a time after vectorization, improving the code performance after vectorization.
6. Avoid condition judgment in the loop and avoid using break to jump out of the loop.
7. Write simple code to make it easier for the compiler to understand and automate vectorization. (The degree of vectorization depends on the degree to which the compiler understands the coding intent of the coder.)
8. Use array subscripts instead of pointers to access elements.
9. When constructing a structure, ensure that the data types of variables in the structure are the same to facilitate quantization during data loading.
  The following is the 4-byte-aligned pixel data structure, which can be vectorized in the following way:
```
struct aligned_pixel {
    char r;
    char g;
    char b;
    char not_used;  /* Padding used to keep r aligned to a 32-bit word */
}screen[10];
```
  If only the type of a single element variable in the structure is changed for data alignment, the variable data types in the structure are different. As a result, automatic vectorization cannot be performed.
```
struct pixel {
    char r;
    short g; /* Green channel contains more information */
    char b;
}screen[10];
```

Using NEON Intrinsic to Improve Performance

Principles

The NEON intrinsic function is a series of C function calls that the compiler can replace with appropriate NEON instructions or NEON instruction sequences. The NEON intrinsic function provides almost the same function as NEON assembly instructions, but the NEON intrinsic function leaves register allocation to the compiler so that developers can focus on algorithm development. Compared with the NEON assembly instruction coding, the NEON intrinsic coding has better maintainability. The Arm, GCC, and LLVM compilers support NEON intrinsic.

Modification Method

To use NEON intrinsic functions, add the header file #include <arm_neon.h>. For details about the NEON intrinsic function list and usage, see NEON intrinsic reference: https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics.

Parent topic: Optimization of Hot Functions