我要评分
获取效率
正确性
完整性
易理解

Feature Description

The big data features of OmniRuntime are presented in the form of plugins to improve the performance of data loading, computing, and exchange from end to end.

Data volumes generated from Internet services have been growing much faster than CPUs' computing power. The open-source big data ecosystem is also developing on a fast track. However, diversified computing engines and open source components make it difficult to improve data processing performance throughout the lifecycle. Different big data engines use their own unique tuning policies and technologies to improve performance and efficiency. Some tuning items may be applied across multiple engines, which may cause resource contention and conflicts, reducing overall computing performance.

OmniRuntime consists of a series of features provided by Kunpeng BoostKit for Big Data in terms of application acceleration. It aims to improve the performance of end-to-end data loading, computing, and exchange through plugins, thereby improving the performance of big data analytics.

Table 1 lists the open source components and versions to which each subfeature of OmniRuntime has been adapted.

Table 1 OmniRuntime subfeatures and open source components

Subfeature Name

Description

Open Source Component and Version

OmniOperator

OmniOperator implements big data SQL operators in native code (C/C++) to improve query performance. It leverages columnar storage and vectorized execution technologies as well as Kunpeng vectorized instructions, and replaces open source Java operators with high-performance native operators to improve operator execution efficiency and query engine performance.

  • Spark 3.1.1
  • Spark 3.3.1
  • Spark 3.4.3
  • Spark 3.5.2
  • Hive 3.1.0
  • openLooKeng 1.6.1
  • Gluten 1.3

OmniShuffle

As a performance acceleration component of the big data engine Spark, OmniShuffle runs in big data clusters of the customer's data center. It employs effective features such as unified addressing of the memory pool, data exchange in memory semantics, and converged shuffle to reduce the drive I/O overhead, quicken the data analysis process, and improve cluster resource utilization.

As a performance acceleration component of Spark, OmniShuffle uses the plugin mechanism provided by Spark to implement the Shuffle Manager and Broadcast Manager plugin interfaces and replace open source Shuffle and Broadcast of Spark in a non-intrusive manner.

  • Spark 3.1.1
  • Spark 3.3.1
  • Hive 3.1.0

OmniAdvisor

OmniAdvisor 1.0: parses parameters of historical Spark and Hive SQL tasks, uses AI algorithms to intelligently tune parameter sampling, and implements end-to-end online parameter tuning for tasks.

  • Spark 3.1.1
  • Spark 3.3.1
  • Hive 3.1.0
  • Tez 0.10.0

OmniAdvisor 2.0: samples parameters of spark-submit tasks and recommends optimal configurations through AI iterative tuning, expert rule–based tuning, migration generalization tuning, and operator acceleration, enabling end-to-end parameter tuning for Spark tasks.

Spark 3.3.1

OmniMV

OmniMV uses AI algorithms to recommend the optimal materialized view from historical SQL queries, automatically matches SQL statements with a materialized view in Spark, and replaces part of the SQL statements in an execution plan with the matched materialized view. This feature reduces repeated calculations and increases the query efficiency. You can submit an SQL task to a Spark cluster. The cluster management node distributes the task to multiple compute nodes as subtasks for execution.

  • Spark 3.1.1
  • Spark 3.4.3
  • Hive 3.1.0
  • ClickHouse 22.3.6.5

OmniScheduler

OmniScheduler enhances the capacity scheduling algorithm of Hadoop Yarn. It obtains the cluster load information and preferentially schedules low-load nodes based on the physical resource weight calculation and sorting results of node. Consequently, it improves load balancing within the cluster with balanced resource configuration and efficient resource utilization.

  • Spark 3.1.1
  • Spark 3.3.1
  • Hive 3.1.0
  • Hadoop 3.3.4

OmniShield

OmniShield is a confidential computing component of the Spark big data engine. It runs in the TEE of the customer's data center to encrypt and decrypt data by executing the computing process in the hardware-based TEE. With OmniShield, data security in the REE is also safeguarded. In confidential computing scenarios, the OmniShield feature provides data source encryption and decryption capabilities for DataFrame and SparkSQL applications, and also end-to-end security protection for Spark applications based on the Arm confidential computing trusted execution environment (TEE) kit.

  • Spark 3.3.1
  • Hive 3.1.0

OmniHBaseGSI

OmniHBaseGSI employs an independent index table to store index data, accelerating SingleColumnValueFilter conditional query. When a given query condition hits an index, the full-table query of the data table is converted to an exact-range query of the index table to increase the query speed.

HBase 2.4.14

OmniData

OmniData pushes operators of the big data engine to storage nodes to implement near-data computing, which reduces network bandwidth consumption and improves the query performance of the query engine. OmniData supports access to popular data types such as ORC and Parquet. It allows Spark to push down the Filter, Aggregation, and Limit operators to CPUs on a storage node to implement near-data computing, reducing transmission of invalid data on the network and improving big data computing performance.

  • Spark 3.0.0
  • Spark 3.1.1
  • Hive 3.1.0
  • openLooKeng 1.4.1
  • openLooKeng 1.6.1

OmniStream

The OmniStream feature uses native code (C/C++) to implement Flink SQL operators to improve query performance. The Flink engine is reconstructed natively for enhanced performance.

Flink 1.16.3

OmniStateStore

OmniStateStore acts as the Flink backend plugin to accelerate state storage and improve the overall Flink performance.

  • Flink 1.16.1
  • Flink 1.16.3
  • Flink 1.17.1