Tuning Overview

Tuning Process Flow

The performance tuning roadmap is as follows:

If the CPU usage is low, resources are not fully used. You can use a tool (such as strace) to check where the application is blocked. Generally, the application is blocked by drives or networks, or the service logic of the application sleeps or waits for signals. These optimization measures are described in other sections.
If the CPU usage is high, you can select better hardware and optimize hardware configuration parameters to adapt to service scenarios, or optimize software to reduce the CPU usage.

Configure DIMMs based on the CPU capability. You are advised to configure DIMMs in full channel configuration to maximize the memory bandwidth. One Kunpeng 920 processor supports eight memory channels, and two Kunpeng 920 processors support 16 memory channels. You are advised to use high-frequency DIMMs to improve memory bandwidth. When the Kunpeng 920 is configured with one DIMM per channel (1DPC), the maximum memory frequency is 3200 MHz.

Main Optimization Parameters

Optimization Item	Description	Default Value	When to Take Effect	Kunpeng 916	Kunpeng 920
Optimizing NUMA configurations	In the NUMA architecture, the access delay is shorter when the CPU core accesses the adjacent memory. Bind applications to a NUMA node to reduce performance deterioration caused by remote memory access.	No core binding configurations by default	Immediately	Yes	Yes
Modifying the CPU prefetch configuration	In data centralization scenarios, data to be accessed can be read to the CPU cache in advance to improve performance. If data is not centralized, the prefetch hit ratio is low and the memory bandwidth is wasted.	On	After the system restarts	No	Yes
Adjusting the timer mechanism	The nohz mechanism reduces unnecessary clock interrupts and CPU scheduling overheads.	Different OSs have different default configurations. Euler: nohz = off	After the system restarts	Yes	Yes
Adjusting the memory page size to 64 KB	A larger memory page size indicates that more memory is managed in each line of the TLB and a higher TLB hit rate, thereby reducing a quantity of memory access times.	Different OSs have different default configurations. 4 KB or 64 KB	After the kernel is recompiled and updated	Yes	Yes
Optimizing the number of concurrent application threads	Properly adjust the number of concurrent threads of applications to balance multi-core capability utilization and resource contention.	Determined by applications	Immediately or after the system restarts (determined by applications)	Yes	Yes

Parent topic: CPU and Memory Subsystem Performance Tuning