OmniData
In a large-scale cluster (over 200 nodes) where big data storage and compute are decoupled or coupled, computing tasks cannot ensure data locality. The operators of the original compute engine are executed on the compute nodes. All of the involved table, row, and column data is read from the storage nodes to the compute nodes, according to which operator operations such as filtering and aggregation are performed then. Because data locality cannot be ensured, a large amount of data needs to be read from storage nodes to compute nodes over the network, resulting in low efficiency and deteriorating computing performance.
The OmniRuntime operator pushdown feature, OmniData, pushes down the operators with low data selection rates (Size of the output dataset executed by the operator/Size of the input dataset executed by the operator) to the storage nodes. In this way, data is read locally on the storage nodes for computing, and valid result datasets are returned to the compute nodes over the network. This improves the network transmission efficiency and optimizes the big data computing performance.
Take a service SQL statement as an example. In the where clause, there are multiple filter operator operations for table fields.
... ( SELECT logid AS xxx ...... FROM xxx_log_hm WHERE pt_d = '20200709' AND optype= '11' AND substr(downloadTime, 1, 10) = '20200709' ) ...
Compute engines, such as Spark, provide CBO statistics, which can be used to obtain the selection rate of these filter operators. In this example, the size of the input dataset of the operator is 800 million rows, the size of the output dataset is 787,000 rows, and the data selection rate is about 1‰. These filter operators are pushed down to the OmniData service on storage nodes for execution, greatly reducing the network transmission volume of data. The SQL computing time is reduced from 3855.67 seconds to 2084.33 seconds. You can set the selection rate for OmniData and enable or disable the pushdown of a single job.
In a big data storage-compute decoupling scenario with typical hardware configuration, according to a TPC-H test, the performance of Spark running 12 SQL statements is improved by an average of 54% after enabling OmniData.
In a big data storage-compute decoupling scenario with typical hardware configuration, according to a TPC-H test, the performance of Hive running 4 SQL statements is improved by an average of 32% after enabling OmniData.
In a customer's 200-node scenario where big data storage and compute are converged, Spark leverages OmniData to implement near-data computing of the operators. The OmniData service on the storage nodes is restricted to use only 10-core compute resources, improving the actual service performance by an average of 26% and even up to 77%.