Application Scenarios
OmniShuffle can handle shuffle-intensive jobs and big data scenarios, and supports Spark 3.1.1 and Spark 3.3.1.
Shuffle-intensive Job Scenarios
After OmniOperator is used, shuffle data is still written to drives. When shuffle-intensive jobs are performed, a large amount of data still needs to be exchanged across nodes after the Map process is complete. Combining OmniShuffle and OmniOperator brings more performance benefits.
Big Data Scenarios
The big data engine Spark is used to perform shuffle-intensive jobs. After the map process is complete, a large amount of data needs to be exchanged across nodes. Statistics show that the Spark shuffle process occupies the most time and resource overhead in many analysis scenarios and even 50% to 80% of the end-to-end time overhead of Spark services in some scenarios.
As a performance acceleration component of Spark, OmniShuffle uses the plugin mechanism provided by Spark to implement the Shuffle Manager and Broadcast Manager plugin interfaces and replace the Shuffle and Broadcast of open source Spark in a non-intrusive manner.
- OmniShuffle enables in-memory shuffle by implementing the Shuffle Manager plugin interface. That is, the shuffle process is completed in the memory pool based on memory semantics, reducing shuffle data flushing to drives. The time overhead and computing power overhead caused by data flushing and reading, serialization and deserialization, compression and decompression can be lessened.
- The Broadcast Manager plugin interface is implemented to enable variable broadcast based on memory pool sharing, improving the transmission efficiency of broadcast variables among executors.
In addition, OmniShuffle supports two network modes: Remote Direct Memory Access (RDMA) and TCP. Compared with TCP, RDMA improves transmission efficiency, requires less computing power, and implements efficient data exchange between nodes.
Data Analytics Scenarios
Traditional data analytics faces the following challenges:
- Data analytics costs have been increasing due to soaring data volumes and heterogeneous data types.
- Traditional shuffle architectures strongly depend on local storage and do not separate storage from compute. As a result, Reduce nodes have a lot of fragmented drive reads and writes, which deteriorate the overall read and write efficiency.
- Traditional shuffle processes have poor reliability. In large-scale service scenarios, shuffle data may be lost when a drive is faulty, causing stage recalculation. If the number of links is too large, network data may fail to be obtained.
The RSS works on decoupled storage and compute architecture. By optimizing the Spark shuffle write process, BoostRSS saves the data generated in the Map stage to RSS nodes. In this way, the original small files and small I/O operations are aggregated into efficient large files and continuous large I/O operations. The RSS significantly improves the overall drive read and write efficiency and alleviates the I/O burden on compute nodes. As a result, MapReduce task execution performance increases.
- ESS mode: The service scale is small or the server specifications are high enough.
- RSS mode: The service scale is large and compute nodes have limited resources.
- You are advised to use an RDMA network and ensure high network bandwidth for desired performance.
- Spark has a plugin mechanism. You can replace the original functions of Spark by implementing the Spark plugin interface.