Kunpeng Pipelines

Execution of the three-level pipeline can be easily interrupted, resulting in low instruction execution efficiency. A subsequently developed five-level instruction pipeline technology is considered as a classic processor configuration, and has been widely used in a plurality of RISC processors. On a basis of the three-level pipeline (instruction fetching, decoding, and execution), two levels is added. The execution phase is further divided into execution, memory access, and write-back, which resolves the memory access instruction delay in the instruction execution phase of a three-level pipeline. However, new problems such as register interlocking may occur, which causes pipeline interruption. The Kunpeng 920 processor uses an eight-level pipeline structure. It first fetches instructions, and then decodes instructions, renames registers, and schedules the instructions. Once the scheduling is complete, instructions are sent to one of the eight execution pipes in an unordered manner. Each execution pipe can receive and complete an instruction in each cycle. Finally, the memory access and write-back operations are performed.

Figure 1 Kunpeng pipeline structure

Kunpeng execution pipes support multiple operation instructions, as shown in Table 1.

**Table 1** Execution pipes and functions
Execution Pipe (Mnemonic)	Function
ALU1 (ALU)	Integer operation
ALU2/3/BRU1/2 (ALU/BRU)	Integer operation, branch jump
Multi-cycle (MDU)	Integer shift, multiplication, division, and CRC operations
LoadStore 0/1 (LS)	Access operation
FP/ASIMD 1 (FSU1)	ASIMD ALU, ASIMD misc, ASIMD integer multiply, FP convert, FP misc, FP add, FP multiply, FP divide, crypto uops, Hivector
FP/ASIMD 2 (FSU2)	ASIMD ALU, ASIMD misc, FP convert, FP misc, FP add, FP multiply, FP sqrt, ASIMD shift uops, Hivector

No matter how perfect the pipeline design is, an interrupt is inevitable, which affects the execution efficiency of the pipeline. The following describes several common pipeline interrupts and solutions.

Branch Jump

Some statements that interrupt program execution in sequence are often used in code. For example, statements such as if, switch, for, while, and return interrupt code execution in sequence. After compilation, the compiler changes the statements to jump instructions such as B, BL, BX, and BLX. These jump instructions block the execution of the pipeline, as shown in Figure 2.

Figure 2 Pipeline interruption caused by branch jump

When the BL instruction is executed, the following two instructions ADD and SUB have entered the pipeline but cannot be executed. After the BL instruction is executed, the PC pointer is adjusted based on the calculated target address, and then the instruction is fetched from the address and enters the pipeline for execution. The ADD and SUB instructions that previously enter the pipeline are cleared, and the pipeline that is originally in the decoding and instruction fetch state is blocked.

Current processors use the branch prediction technology to reduce the impact of jump instructions on the pipeline. It includes static prediction and dynamic prediction. Static prediction is processed at compile time. As for dynamic prediction, in the instruction fetch phase of code execution, the most possible location of the next instruction to be fetched is pre-determined within a range. By doing so, instructions are fetched following the code execution sequence, instead of following the storage sequence of the program instructions in the memory.

Program developers can use the static prediction method to improve the accuracy of branch prediction. For example, the Linux kernel source code shows that the likely() and unlikely() macros are defined in the Linux kernel.

# ifndef likely
#  define likely(x) (__builtin_expect(!!(x), 1))
# endif
# ifndef unlikely
#  define unlikely(x) (__builtin_expect(!!(x), 0))
# endif

__builtin_expect is a preprocessing command provided by GCC. __builtin_expect((x), 1) indicates that the value of x is likely to be true, and __buildin_expect((x), 0) indicates that the value of x is likely to be false. That is, likely() tends to execute the statement after if, and unlikely() tends to execute the statement after else. In this way, the compiler closely follows code with a higher probability after the branch jump statement to increase the instruction cache hit ratio and improve pipeline execution efficiency. The following code is an example:

if (likely(sem->count > 0))
 sem->count--;
else 
        __down(sem);

However, many code branches are closely related to service running scenarios and cannot be manually determined. To solve this problem, the common practice in the industry is as follows: During source code compilation, the compiler automatically adds variables to code branches and runs the compiled program in the real or simulated environment for a long time. In this process, the statistics on the number of operations that each code branch is entered and the proportion of each branch is collected and fed back to the compiler. Then the compiler rearranges the code and places pieces of code with a high execution probability after branch jump statements to reduce pipeline blocking, thereby improving pipeline efficiency and performance.

Processor Interrupts

During system running, an interrupt may be generated at any time and is irrelevant to the current command. When an interrupt is generated, the processor does not interrupt the instruction that is being executed. Instead, the processor responds to the interrupt after executing the instruction, as shown in Figure 3 (taking the three-level pipeline as an example).

Figure 3 Interrupt occurrence

If an interrupt is generated when the ADD instruction is being executed, the processor automatically jumps to the exception interrupt vector table 0x18 (B target address) starting from address 0, saves the interrupt site, and then jumps to the interrupt handler (SUB) to process the interrupt. In this process, two jumps occur, and an interrupt is returned, which wastes a lot of CPU cycles and greatly reduces the pipeline efficiency.

Interrupts are unpredictable. Therefore, the compiler cannot rearrange the code to optimize the pipeline, but interrupts can be manually intervened. For example, in some scenarios, you can modify kernel parameters to reduce the number of interrupts by combining interrupts. Alternatively, bind interrupts to a core and use a specific CPU core to process the interrupts. The core where the service process is located can be more focused on processing service data without being interrupted.

Parent topic: Pipeline