Case: Application System Suspension
Symptom
A server stops responding when some software is running on it.
Fault Locating
- Configure kdump and kernel parameters.
- Reproduce the problem and confirm that a vmcore file is generated. Use Crash for debugging. As shown in the following figure, there is a null pointer.

- Check the error stack. It is possible that the inet_sock_destruct function is incorrect.

- Disassemble the inet_sock_destruct function.

- View the source code of the inet_sock_destruct function.



According to the assembly and source code analysis, the x2 register is a null pointer address, which means that the __skb_dequeue is null. However, __skb_dequeue must be set when it is transferred in the kfree_skb function. This indicates that other threads are also operating the __skb_dequeue. After the current thread determines that the __skb_dequeue is not empty, other threads make the __skb_dequeue empty. As a result, an error is reported in subsequent operations of the current thread.
- Find out which other threads perform operations on the __skb_dequeue. You need to add debugging information to the kernel to save the stack information about each __skb_dequeue operation. As shown in the following figure, modify the code to save the stack information when an operation is performed on the __skb_dequeue.

- Recompile the kernel, run the program, generate a new vmcore file, use Crash for debugging, and locate the fault based on the preceding locating method.

- Analyze the call logic of the stack:
- Run the program again, generate another vmcore file, use Crash for debugging, and locate the fault based on the preceding method.

- Analyze the two vmcore files you generated previously. It is found that two threads are operating the __skb_dequeue at the same time. The following figure shows the calling relationship.

When multiple cores process different logics, the __skb_dequeue operations may not be synchronized. As a result, the queue is released repeatedly and the system suspends.
- To ensure operation consistency and avoid repeated release, add a memory barrier before the sk_free operation. Recompile and run the kernel. The problem is solved.
1 2 3 4 5 6 7 8
void __sock_wfree(struct sk_buff *skb) { struct sock *sk = skb->sk; if (refcount_sub_and_test(skb->truesize, &sk->sk_wmem_alloc)) { smp_rmb(); //Add a memory barrier for verification. __sk_free(sk); } }
- Check the kernel community. It is found that the community has added the memory barrier to the refcount_sub_and_test function in version 5.1. You can update the kernel version to fix this problem.

Parent topic: System Suspension