Constraints

To effectively plan and utilize the OmniOperator feature, it is recommended to be aware of potential risks and limitations.

Common Constraints

Restrictions on the Decimal data type: The OmniOperator feature supports 64-bit and 128-bit Decimal types. If a decimal value exceeds 128 bits, an exception may be thrown or a null value returned. This behavior can make OmniOperator's results inconsistent with the open source version of the engine for aggregation operations (for example, SUM or AVG). If a field may involve AVG operations and the accumulated result could become very large, it is recommended to use Double or another suitable type to reduce the risk of overflow or errors.
Floating-point precision of the Double type: When using the Double type for operations such as SUM or AVG, the OmniOperator feature may produce inconsistent results due to differences in computation order. If high precision is required, use a data type with higher precision, such as Decimal.
Currently, operators like Sort, Window, and HashAgg support the spill function, while operators such as BroadcastHashJoin, ShuffledHashJoin, and SortMergeJoin do not. Select operators based on your data characteristics and processing requirements.

Hive Engine Constraints

The user-defined function (UDF) plugin supports only simple UDFs. It is used to execute UDFs written based on the Hive UDF framework.

While Hive OmniOperator executes the 99 TPC-DS statements, it does not support q14, q72, or q89 because open source Hive may have problems when executing q14, q72, and q89.
When Hive OmniOperator is working on POWER expressions, there is a slight implementation difference between the C++ std:pow function and Java Math.pow function. As a result, the POWER expression implemented using C++ is different from the open source POWER expression of Hive, but the relative precision error is not greater than 1e-15.
When Hive OmniOperator is used in floating-point arithmetic, an issue that does not match open source Hive behaviors may occur. For example, when dividing the floating-point number of 0.0, open source Hive returns Null, whereas OmniOperator returns Infinity, NaN, or Null.
CBO optimization is enabled by default for the Hive engine. Hive OmniOperator must have CBO optimization enabled, specifically, hive.cbo.enable cannot be set to false.
If SQL statements contain the Alter field attribute or use LOAD DATA to import parquet data, the open source TableScan operator is recommended for the Hive engine.
- When the data storage structure declared by Hive OmniOperator in the table does not match the actual storage structure and the GroupBy operator is consistent with bucketing parameters, the GroupBy operator may encounter a grouping exception in open source Hive. Therefore, to ensure that the declared storage structure matches the actual storage structure, use no bucketing policy when creating the table or run load data local inpath to import data.
- When the sum result overflows, Hive OmniOperator may generate a result different from open source behaviors of the engine. OmniOperator returns null for users to perceive the overflow, whereas open source Hive returns an error value, which may cause misunderstanding.

Spark Engine Constraints

The adaptation layer frameworks that enable Spark to interoperate with OmniOperator are SparkExtension and Gluten.

Supported Spark versions:
- Spark 3.1.1, 3.3.1, 3.4.3, and 3.5.2
- Gluten supports only Spark 3.3.1.
OS differences:
- SparkExtension supports CentOS 7.9, openEuler 20.03, and openEuler 22.03.
- Gluten supports openEuler 22.03.

Different loads require different memory configurations. For example, for a 3 TB TPC-DS dataset, the recommended SparkExtension configuration requires that off-heap memory be greater than or equal to 20 GB so that all the 99 SQL statements can be successfully executed. During the execution, "MEM_CAP_EXCEEDED" may be reported in logs, but the execution result is not affected. If the off-heap memory configuration is too low, the SQL statement execution result may be incorrect.
Spark OmniOperator supports the from_unixtime and unix_timestamp expressions.
1. The time parsing policy spark.sql.legacy.timeParserPolicy must be EXCEPTION or CORRECTED, and cannot be LEGACY.
2. For some improper parameter values (such as non-existent dates and invalid ultra-large timestamp values), the processing results of OmniRuntime are different from those of open source Spark.
3. In the SparkExtension scenario, you can set spark.omni.sql.columnar.unixTimeFunc.enabled=false to roll back the two functions. In the Gluten scenario, you can set spark.gluten.sql.columnar.backend.omni.unixTimeFunc.enabled to roll back the two functions. That is, use the functions corresponding to the open source Spark version to avoid the difference in 2.
When Spark OmniOperator performs expression codegen on a large number of columns (for example 500 columns) at the same time, the compilation overhead is greater than the OmniOperator acceleration effect. In this scenario, you are advised to use open source Spark.
OmniOperator does not support decimal128 CHAR or AVG function data or of Spark 3.4.3 or Spark 3.5.2. Such data may cause operation rollback during operation acceleration.
OmniOperator does not support ORC write for Spark 3.4.3 or Spark 3.5.2.
OmniOperator supports the ROW_NUMBER and rank functions in Spark 3.5.2. In the dense_rank scenario, operators are rolled back.
Spark OmniOperator does not support comparison operators (<, <=, >, >=, !=, <>, =, ==, <=>) for Boolean data, and does not support <=> for any data type. If an incompatible operation exists during the execution, it is normal that the operator is rolled back. If a rollback occurs during the join operation of a large table, performance may deteriorate due to the high overhead of row-to-column conversion. In practice, it is recommended to avoid such scenarios to minimize the impact of rollback on performance.

Parent topic: Feature Overview