Rate This Document
Findability
Accuracy
Completeness
Readability

OneForOneBlockFetcher Errors Occasionally Occur When a 10 TB Dataset Is Run on Spark 3.1.1

Symptom

In Spark 3.1.1, a large dataset is 10 TB. If spark.network.timeout is set to a small value, fetching data may time out during shuffle, triggering OneForOneBlockFetcher errors and causing inconsistent data results.

Key Process and Cause Analysis

For a 10 TB dataset, the default value of spark.network.timeout is 120s. In the shuffle phase, if a fetch exception occurs, for example, a timeout, the data is fetched again. In this phase, the data is fetched incorrectly due to incorrect block IDs. Consequently, there is a possibility that data inconsistency occurs. This is a bug of the Spark community code and has been fixed in https://github.com/apache/spark/pull/31643.

Conclusion and Solution

Set spark.network.timeout to a larger value to prevent data fetch timeout. You are advised to set the value to 600.