OneForOneBlockFetcher Errors Occasionally Occur When a 10 TB Dataset Is Run on Spark 3.1.1
Symptom
In Spark 3.1.1, a large dataset is 10 TB. If spark.network.timeout is set to a small value, fetching data may time out during shuffle, triggering OneForOneBlockFetcher errors and causing inconsistent data results.
Key Process and Cause Analysis
For a 10 TB dataset, the default value of spark.network.timeout is 120s. In the shuffle phase, if a fetch exception occurs, for example, a timeout, the data is fetched again. In this phase, the data is fetched incorrectly due to incorrect block IDs. Consequently, there is a possibility that data inconsistency occurs. This is a bug of the Spark community code and has been fixed in https://github.com/apache/spark/pull/31643.
Conclusion and Solution
Set spark.network.timeout to a larger value to prevent data fetch timeout. You are advised to set the value to 600.