OmniShuffle

As a performance acceleration component of Spark, OmniShuffle uses the plugin mechanism provided by Spark to implement the Shuffle Manager and Broadcast Manager plugin interfaces and replace the native Shuffle and Broadcast of Spark in a non-intrusive manner. OmniShuffle enables in-memory shuffle by implementing the Shuffle Manager plugin interface. That is, the shuffle process is completed in the memory pool based on memory semantics, reducing shuffle data flushing to disks. The time overhead and computing power overhead caused by data flushing and reading, serialization and deserialization, compression and decompression can be lessened. In addition, the Broadcast Manager interface is implemented to enable variable broadcast based on memory pool sharing, improving the transmission efficiency of broadcast variables among executors. In addition, OmniShuffle supports two network modes: Remote Direct Memory Access (RDMA) and TCP. Compared with TCP, RDMA improves transmission efficiency, requires less computing power, and implements efficient data exchange between nodes.

In addition, OmniShuffle automatically adjusts the parallelism degree of Spark SQL jobs in real time based on historical data, eliminating the need to manually optimize the parallelism degree and reducing spills in the shuffle-reduce process by 90%. Due to this, OmniShuffle quickens big data cluster jobs while increasing the job throughput.

Overall Solution

OmniShuffle enables in-memory shuffle by implementing the Shuffle Manager plugin interface. Figure 1 shows the overall solution architecture of OmniShuffle. Table 1 describes the subsystems.

Figure 1 Logical architecture of OmniShuffle

**Table 1** Subsystems
Subsystem	Description
Memory pool kit	Provides the distributed shared memory infrastructure and basic memory semantics.

Service Process

OmniShuffle uses the Spark Shuffle Manager interface to implement in-memory shuffle. Figure 2 shows the OmniShuffle service process.

Figure 2 Service process

After connecting OmniShuffle to Spark, you can access the Spark CLI to view the cluster running status.

Parent topic: Key Features