Rate This Document
Findability
Accuracy
Completeness
Readability

Case: Application System Suspension

Symptom

A server stops responding when some software is running on it.

Fault Locating

  1. Configure kdump and kernel parameters.
  2. Reproduce the problem and confirm that a vmcore file is generated. Use Crash for debugging. As shown in the following figure, there is a null pointer.

  3. Check the error stack. It is possible that the inet_sock_destruct function is incorrect.

  4. Disassemble the inet_sock_destruct function.

  5. View the source code of the inet_sock_destruct function.

    According to the assembly and source code analysis, the x2 register is a null pointer address, which means that the __skb_dequeue is null. However, __skb_dequeue must be set when it is transferred in the kfree_skb function. This indicates that other threads are also operating the __skb_dequeue. After the current thread determines that the __skb_dequeue is not empty, other threads make the __skb_dequeue empty. As a result, an error is reported in subsequent operations of the current thread.

  6. Find out which other threads perform operations on the __skb_dequeue. You need to add debugging information to the kernel to save the stack information about each __skb_dequeue operation. As shown in the following figure, modify the code to save the stack information when an operation is performed on the __skb_dequeue.

  7. Recompile the kernel, run the program, generate a new vmcore file, use Crash for debugging, and locate the fault based on the preceding locating method.

  8. Analyze the call logic of the stack: tcp_done > inet_csk_destroy_sock > sk_stream_kill_queues > kfree_skb > skb_release_all
  9. Run the program again, generate another vmcore file, use Crash for debugging, and locate the fault based on the preceding method.

  10. Analyze the two vmcore files you generated previously. It is found that two threads are operating the __skb_dequeue at the same time. The following figure shows the calling relationship.

    When multiple cores process different logics, the __skb_dequeue operations may not be synchronized. As a result, the queue is released repeatedly and the system suspends.

  11. To ensure operation consistency and avoid repeated release, add a memory barrier before the sk_free operation. Recompile and run the kernel. The problem is solved.
    1
    2
    3
    4
    5
    6
    7
    8
    void __sock_wfree(struct sk_buff *skb)
    {
    struct sock *sk = skb->sk;
    if (refcount_sub_and_test(skb->truesize, &sk->sk_wmem_alloc)) {
    smp_rmb(); //Add a memory barrier for verification.
    __sk_free(sk);
    }
    }
    
  12. Check the kernel community. It is found that the community has added the memory barrier to the refcount_sub_and_test function in version 5.1. You can update the kernel version to fix this problem.