Lock Optimization

Principles

The spin lock and CAS instructions are implemented based on atomic operation instructions. When an application fails to perform an atomic operation, the application does not release CPU resources. Instead, the application runs cyclically until the atomic operation is successfully performed. As a result, CPU resources are wasted. As shown in the following figure, the yellow part indicates a cyclic waiting process.

Modification Method

You can run the perf top command to analyze the functions that occupy the most CPU resources. If the lock application and release rate is greater than 5%, you can optimize the lock implementation. The modification roadmap is as follows:

Changing a large lock to a small lock: In the scenario with a large number of concurrent tasks, if the system has a unique global variable, each CPU core applies for the lock corresponding to the global variable. As a result, the lock contention is severe. Resources can be allocated to each CPU core or thread based on service logic.
When the ldaxr and stlxr instructions are used to implement atomic operations, memory consistency can be ensured at the same time. However, the ldxr and stxr instructions cannot ensure memory consistency. Therefore, the memory barrier instruction (dmb ish) is required to implement memory consistency. According to the test result, the performance of the ldaxr + stlxr instruction is higher than that of the ldxr + stxr + dmb ish instruction.
Reduce the number of concurrent threads. For details, see Adjusting Number of Concurrent Threads.
Use Cacheline alignment for lock variables. For lock variables that are frequently accessed, frequent read and write operations are performed on the lock variables, which may cause false sharing. For details, see Optimizing Cacheline.
Optimize the implementation of atomic operations in the code. The following figure shows a software code implementation.

From the perspective of function calling logic, atomic read, variable addition, and atomic write operations are repeatedly executed in the while loop, and the code is redundant. Optimization method: Use the atomic_add_return instruction to replace the code process to simplify the instruction and improve the performance. The following figure shows the code after replacement.

Parent topic: Optimization Methods