OneForOneBlockFetcher Errors Occasionally Occur When a 10 TB Dataset Is Run on Spark 3.1.1

Symptom

When Spark 3.1.1 processes a 10 TB dataset, if spark.network.timeout is set to a small value, data fetch in the shuffle phase may time out. As a result, an error related to OneForOneBlockFetcher is triggered and the final data result may be inaccurate.

Key Process and Cause Analysis

For a 10 TB dataset, the default spark.network.timeout of Spark is set to 120 seconds. In the shuffle phase, if an exception (for example, timeout) occurs during data fetch, Spark attempts to fetch data again. However, because the block ID is incorrect, the content of the re-fetched data may also be incorrect, potentially causing data inconsistency. This problem has been confirmed as a bug in the Spark community code and has been submitted for fixing on GitHub.

Conclusion and Solution

Increase the value of spark.network.timeout to avoid data fetch timeout. The recommended value is 600, which can solve the problem in this case.

Parent topic: OmniOperator