Application Scenarios
Learn about the OmniShuffle application scenarios before using the feature.
After OmniOperator is used, shuffle data is still written to drives. When shuffle-intensive jobs are performed, a large amount of data still needs to be exchanged across nodes after the Map process is complete. Combining OmniShuffle and OmniOperator brings more performance benefits, especially for shuffle-intensive jobs.
In big data scenarios, the big data engine Spark is used to perform shuffle-intensive jobs. After the map process is complete, a large amount of data needs to be exchanged across nodes. Statistics show that the Spark shuffle process occupies the most time and resource overhead in many analysis scenarios and even 50% to 80% of the end-to-end time overhead of Spark services in some scenarios.
As a performance acceleration component of Spark, OmniShuffle uses the plugin mechanism provided by Spark to implement the Shuffle Manager and Broadcast Manager plugin interfaces and replace the native Shuffle and Broadcast of Spark in a non-intrusive manner. OmniShuffle enables in-memory shuffle by implementing the Shuffle Manager plugin interface. That is, the shuffle process is completed in the memory pool based on memory semantics, reducing shuffle data flushing to drives. The time overhead and computing power overhead caused by data flushing and reading, serialization and deserialization, compression and decompression can be lessened. In addition, the Broadcast Manager interface is implemented to enable variable broadcast based on memory pool sharing, improving the transmission efficiency of broadcast variables among executors. In addition, OmniShuffle supports two network modes: Remote Direct Memory Access (RDMA) and TCP. Compared with TCP, RDMA improves transmission efficiency, requires less computing power, and implements efficient data exchange between nodes.
In addition, OCK BoostTuning for Spark SQL automatically adjusts the parallelism degree of Spark SQL jobs in real time based on historical data, eliminating the need to manually optimize the parallelism degree and reducing spills in the shuffle-reduce process by 90%. Due to this, OCK BoostTuning quickens big data cluster jobs while increasing the job throughput.
Spark has a plugin mechanism. You can replace the original functions of Spark by implementing the Spark plugin interface.