Difference Between Architectures
The following uses a spin lock on the x86 platform as an example to describe how to port a spin lock:
#define barrier() __asm__ __volatile__("": : :"memory")
int CompareAndSwap(volatile int* ptr,
int old_value,
int new_value) {
int prev;
__asm__ __volatile__("lock; cmpxchgl %1,%2"
: "=a" (prev)
: "q" (new_value), "m" (*ptr), "0" (old_value)
: "memory");
return prev;
}
static void lock(int *l){
while(CompareAndSwap(l, 0, 1) != 0);
}
static void unlock(int volatile *l){
barrier();
*l = 0;
}
Instruction Set Difference
This is a simplified spin lock implementation. When porting the CompareAndSwap function, pay attention to the difference between the instruction sets in the two architectures. In this implementation, the x86 cmpxchgl instruction is used through the inline assembly syntax. However, there is no corresponding instruction in the ARM architecture.
In the ARM architecture, atomic operations are implemented by using the exclusive instruction pair, as shown in the following figure (from Arm Architecture Reference Manual for A-profile architecture).

Therefore, the exclusive instruction pair is used to implement the CompareAndSwap function as follows:
int CompareAndSwap(volatile int* ptr,
int old_value,
int new_value) {
int prev;
int temp;
__asm____volatile__ (
"0: \n\t"
"ldxr %w[prev], %[ptr] \n\t"
"cmp %w[prev], %w[old_value] \n\t"
"bne 1f \n\t"
"stxr %w[temp], %w[new_value], %[ptr] \n\t"
"cbnz %w[temp], 0b \n\t"
"1: \n\t"
: [prev]"=&r" (prev),
[temp]"=&r" (temp),
[ptr]"+Q" (*ptr)
: [old_value]"IJr" (old_value),
[new_value]"r" (new_value)
: "cc", "memory"
);
return prev;
}
Memory Sequence Difference
After the CompareAndSwap function is replaced, the spin lock does not work as expected due to the memory sequence difference between the x86 and ARM.
Code before the modification:
static void lock(int *l){
while(CompareAndSwap(l, 0, 1) != 0);
}
static void unlock(int volatile *l){
barrier();
*l = 0;
}
As described in Table 1, out-of-order execution between atomic operations and memory read/write operations is allowed in the ARM architecture. As a result, memory access after the lock function in the above code may be performed before the atomic operations in the lock obtain the lock.
In addition, when the lock is released, a compiled memory barrier is used in the original code. However, in the ARM memory sequence model, the memory barrier is not sufficient to ensure the correctness. Therefore, the memory barrier needs to be changed to a CPU-level memory barrier.
Modify the code as follows:
#define smp_mb() asm volatile("dmb ish" ::: "memory")
static void lock(int *l){
while(CompareAndSwap(l, 0, 1) != 0);
smp_mb();
}
static void unlock(int volatile *l){
smp_mb();
*l = 0;
}
The porting of the spin lock function is complete. It has been verified that the implementation can work as expected. In fact, according to Using acquire and release Semantics for Synchronization, the implementation of the lock can be further optimized by using a half-barrier to improve the lock performance. Use only the acquire semantic when a lock is obtained, and use only the release semantic when a lock is released. In this way, the preceding CPU-level full memory barrier can be removed. The final code in the ARM architecture is as follows:
int CompareAndSwap(volatile int* ptr,
int old_value,
int new_value) {
int prev;
int temp;
__asm____volatile__ (
"0: \n\t"
"ldaxr %w[prev], %[ptr] \n\t"
"cmp %w[prev], %w[old_value] \n\t"
"bne 1f \n\t"
"stxr %w[temp], %w[new_value], %[ptr] \n\t"
"cbnz %w[temp], 0b \n\t"
"1: \n\t"
: [prev]"=&r" (prev),
[temp]"=&r" (temp),
[ptr]"+Q" (*ptr)
: [old_value]"IJr" (old_value),
[new_value]"r" (new_value)
: "cc", "memory"
);
return prev;
}
static void lock(int *l) {
while(CompareAndSwap(l, 0, 1) != 0);
}
static void unlock(int volatile *l)
{
int zero = 0;
__atomic_store(l, &zero, __ATOMIC_RELEASE);
}