Kunpeng Processor NUMA Overview
With the rapid development of informatization and intelligence in modern society, more and more devices are connected to the Internet, Internet of Things (IoT), and Internet of Vehicles (IoV), resulting in huge computing requirements. However, the power consumption wall problem has a great impact on the development of the single-core computing power due to two major limitations: power consumption and cooling. To meet the rapidly growing computing power requirements of the intelligent world, the multi-core architecture has become the most important evolution direction.
The traditional multi-core solution uses the symmetric multi-processing (SMP) technology, as shown in Figure 1. In an SMP architecture, all processors have equal status, and they have the same access to the memory. Any program, process, or thread can be allocated to any processor for running. With the support of the operating system, perfect load balancing can be achieved, which greatly improves the performance and throughput of the entire system. However, multiple cores use the same bus to access the memory. As the number of cores increases, the bus becomes a bottleneck, which restricts the system scalability and performance.
The Kunpeng processors support the non-uniform memory access (NUMA) architecture, which removes the restriction of the SMP technology on the number of CPU cores. In the NUMA architecture, multiple cores form a node, and each node is equivalent to an SMP. Nodes of a CPU communicate with each other through the on-chip network, and different CPUs communicate with each other through the Hydra interfaces to implement inter-chip communication with high bandwidth and low latency, as shown in Figure 2. In the NUMA architecture, the entire memory space is physically distributed, and a set of all these dual in-line memory modules (DIMMs) is the global memory of the entire system. The memory access time of each core depends on the location of the memory relative to the processor. The access to the local memory (on the local node) is faster. The Linux kernel supports the NUMA architecture since version 2.5. The current OSs also provide various tools and interfaces to help optimize and configure the nearest memory access. A computer system implemented by using the Kunpeng processor can achieve good performance, resolve a bus bottleneck problem in an SMP architecture, provide a stronger multi-core expansion capability, and provide a better and more flexible computing capability through proper performance optimization.

