OmniShuffle
As a performance acceleration component of Spark, OmniShuffle uses the plugin mechanism provided by Spark to implement the Shuffle Manager and Broadcast Manager plugin interfaces and replace the native Shuffle and Broadcast of Spark in a non-intrusive manner. OmniShuffle enables in-memory shuffle by implementing the Shuffle Manager plugin interface. That is, the shuffle process is completed in the memory pool based on memory semantics, reducing shuffle data flushing to disks. The time overhead and computing power overhead caused by data flushing and reading, serialization and deserialization, compression and decompression can be lessened. In addition, the Broadcast Manager interface is implemented to enable variable broadcast based on memory pool sharing, improving the transmission efficiency of broadcast variables among executors. In addition, OmniShuffle supports two network modes: Remote Direct Memory Access (RDMA) and TCP. Compared with TCP, RDMA improves transmission efficiency, requires less computing power, and implements efficient data exchange between nodes.
In addition, OmniShuffle automatically adjusts the parallelism degree of Spark SQL jobs in real time based on historical data, eliminating the need to manually optimize the parallelism degree and reducing spills in the shuffle-reduce process by 90%. Due to this, OmniShuffle quickens big data cluster jobs while increasing the job throughput.
Overall Solution
OmniShuffle enables in-memory shuffle by implementing the Shuffle Manager plugin interface. Figure 1 shows the overall solution architecture of OmniShuffle. Table 1 describes the subsystems.
Service Process
OmniShuffle uses the Spark Shuffle Manager interface to implement in-memory shuffle. Figure 2 shows the OmniShuffle service process.
After connecting OmniShuffle to Spark, you can access the Spark CLI to view the cluster running status.

