NameNode Service Forcibly Stopped Occasionally Under Concurrent Spark Tasks
Symptom
In tests of concurrent Spark tasks, when the task submission rate reaches 1,800 per hour, tasks begin to accumulate in the system. During this period, the NameNode service is occasionally forced to stop.
Key Process and Cause Analysis
- Check the NameNode logs. No obvious exception information is found.
- Run the dmesg command to check the system kernel logs. It is found that the NameNode process is stopped by the system due to insufficient system memory.
After analyzing the task process within the cluster, it is found that the task submission speed is much higher than the task execution speed. As a result, tasks that are submitted to Spark YARN but have not been executed accumulate. As the number of accumulated tasks grows, the Driver node's memory consumption increases sharply. This triggers Linux's memory protection mechanism, which stops the NameNode process that consumes a large amount of memory, thereby preventing a system crash.
Conclusion and Solution
In high-concurrency scenarios, the imbalance between task submission and completion leads to task accumulation, which in turn exhausts the memory of the Driver node and ultimately causes the NameNode process to be terminated by the system.
This problem can be avoided by properly adjusting the task submission policy to reduce the concurrency. This ensures that the task submission speed matches the cluster processing capability.