Rate This Document
Findability
Accuracy
Completeness
Readability

NameNode Service Forcibly Stopped Occasionally Under Concurrent Spark Tasks

Symptom

In tests of concurrent Spark tasks, when the task submission rate reaches 1,800 per hour, tasks begin to accumulate in the system. During this period, the NameNode service is occasionally forced to stop.

Key Process and Cause Analysis

  1. Check the NameNode logs. No obvious exception information is found.
  2. Run the dmesg command to check the system kernel logs. It is found that the NameNode process is stopped by the system because the system memory is insufficient.

    After analyzing the task process within the cluster, it is found that the task submission speed is much higher than the task execution speed. As a result, tasks that are submitted on Spark YARN but have not been executed accumulate. As the number of accumulated tasks grows, the Driver node's memory consumption increases sharply. This triggers Linux's memory protection mechanism, which stops the NameNode process that consumes a large amount of memory, thereby preventing a system crash.

Conclusion and Solution

In high-concurrency scenarios, the task submission and execution speeds differ greatly. As a result, tasks accumulate, the memory of the Driver node is used up, and the NameNode process is stopped by the system.

This problem can be avoided by properly adjusting the task submission policy to reduce the concurrency. This ensures that the task submission speed matches the cluster processing capability.