Rate This Document
Findability
Accuracy
Completeness
Readability

Case 1: System Breakdown Caused by a Deadlock

Symptom

A server breaks down occasionally when some software is running on it.

Fault Locating

  1. Configure kdump and enable KASAN to reproduce the problem.
  2. Disable KASAN to reproduce the problem if the problem does not recur for a long time.
  3. Confirm that a vmcore file is generated. Use Crash to debug the vmcore file. As shown in the following figure, it is confirmed that a blocked task exists.

  4. View the stack information. The kubelet process fails to apply for a lock. As a result, a deadlock occurs.
  5. Check the log in the preceding figure. The __netlink_dump_start function invokes the mutex_lock function which holds the lock.
  6. View the __netlink_dump_start kernel source code.

    The cb_mutex in the nlk structure is used as the input parameter of the mutex_lock function.

  7. Check the offset of cb_mutex in the structure variable nlk.

    The offset of cb_mutex in the structure variable nlk is 920.

  8. Disassemble the __netlink_dump_start function.

    The assembly operation ldr x0, [x0, #920] is displayed. According to the preceding analysis, the address is moved from the initial address of nlk to the member variable cb_mutex. Therefore, the initial address of the nlk is stored in the x0 register before this command is executed. However, the value of the x0 register is still unknown. In the previous assembly operation mov x19, x0, you can see that the register content is copied to x19. According to the features of the x19 register, the value of the x19 register is saved when the subfunction mutex_lock is called.

  9. Disassemble the mutex_lock function. According to the assembly operation str x19, [sp,#16], save x19 to sp+16 in the mutex_lock function.

  10. Check the stack frame value of mutex_lock. The value is ffff00089a9efae0, as shown in the following figure. The address of the nlk structure variable is ffff00089a9efae0 + 0x10 = ffff00089a9efaf0.

  11. Obtain the memory location of the nlk structure variable.
    1
    2
    crash> rd 0xffff00089a9efaf0
    ffff00089a9efaf0: ffffa05fc828f800
    
  12. Obtain the position of the lock when mutex_lock occurs.
    1
    2
    crash> struct netlink_sock.cb_mutex ffffa05fc828f800 –x
    cb_mutex = 0xffff000008e6e7b0 <rtnl_mutex>
    
  13. Check the mutex lock structure.
    1
    crash> struct mutex 0xffff000008e6e7b0 –x
    

    Ensure that the lock holder is 0xffffa03b9dd80000. (The last three bits are the flag bits instead of the actual lock holder address. Therefore, the last three bits are set to 0.)

  14. View the ID of the process to which the lock holder belongs.

  15. View which other processes are requesting the lock.

  16. Check the related process code to determine the cause of the deadlock. Modify the code to avoid deadlock and recompile and run the program. No problem occurs. No further action is required.