Node Exception Caused by Excessive Memory Usage

Symptom

When the spark-submit script is executed to submit a task to test the open source LDA algorithm and HiBench is used to generate a dataset (with 20 million samples and 200 thousand feature dimensions), the connection fails.

Key Process and Cause Analysis

The total memory is 384 GB and the free memory is 364 GB. However, the open source LDA algorithm task occupies a maximum of 363.1 GB memory. That means the free memory is only 0.9 GB and the memory usage reaches up to 99.77%. As a result, the OS stops responding.

To improve the test performance, resource parameters such as num-executors and executor-memory configured for the algorithm occupy 99% of the system memory. When the test dataset is large, for example, when the open source LDA algorithm runs the D20M200K dataset, too much memory is occupied. As a result, the OS stops responding.

Conclusion and Solution

Take system stability into consideration unless for extreme performance testing purposes. You are advised to configure Spark task resource parameters based on the actual available cluster resources. Do not use up all system resources.

Parent topic: Troubleshooting