Constraints
Understand the OmniOperator usage restrictions before configuring the feature.
- The user-defined function (
UDF ) plugin supports only simple UDFs. It is used to execute UDFs written based on the Hive UDF framework. - OmniOperator supports 64-bit and 128-bit Decimal data types. If the Decimal data exceeds 128 bits, an exception is thrown or Null is returned. In this case, an issue that does not match open source behaviors of the engine may occur. For example, during SUM or AVG aggregation, if the intermediate result exceeds Decimal 128 bits, the open source behaviors are normal, but OmniOperator throws an exception or returns Null based on the configuration. If AVG calculation is required for a field and the accumulated result may be too large, use another storage type such as Double.
- Different loads require different memory configurations. For example, for a 3 TB TPC-DS dataset, the recommended SparkExtension configuration requires that off-heap memory be greater than or equal to 20 GB so that all the 99 SQL statements can be successfully executed. During the execution, "MEM_CAP_EXCEEDED" may be reported in logs, but the execution result is not affected. If the off-heap memory configuration is too low, the SQL statement execution result may be incorrect.
- The spill function is available for Sort, Window, and HashAgg operators but not for BroadcastHash Join, ShuffledHash Join, and SortMerge Join.
- While Hive OmniOperator executes the 99 TPC-DS statements, it does not support q14, q72, or q89 because open source Hive may have problems when executing q14, q72, and q89.
- When Hive OmniOperator is working on POWER expressions, there is a slight implementation difference between the C++ std:pow function and Java Math.pow function. As a result, the POWER expression implemented using C++ is different from the open source POWER expression of Hive, but the relative precision error is not greater than 1e-15.
- Spark OmniOperator supports the from_unixtime and unix_timestamp expressions.
- The time parsing policy spark.sql.legacy.timeParserPolicy must be EXCEPTION or CORRECTED, and cannot be LEGACY.
- For some improper parameter values (such as non-existent dates and invalid ultra-large timestamp values), the processing results of OmniRuntime are different from those of open source Spark.
- You can set spark.omni.sql.columnar.unixTimeFunc.enabled to false to roll back the two functions. That is, use the open source functions to avoid the inconsistency described in 2.
- When Hive OmniOperator is used in floating-point arithmetic, an issue that does not match open source Hive behaviors may occur. For example, when dividing the floating-point number of 0.0, open source Hive returns Null, whereas OmniOperator returns Infinity, NaN, or Null.
- CBO optimization is enabled by default for the Hive engine. Hive OmniOperator must have CBO optimization enabled, specifically, hive.cbo.enable cannot be set to false.
- If SQL statements contain the Alter field attribute or use LOAD DATA to import .parq data, the open source TableScan operator is recommended for the Hive engine.
- When Spark OmniOperator performs expression codegen on a large number of columns (for example 500 columns) at the same time, the compilation overhead is greater than the OmniOperator acceleration effect. In this scenario, you are advised to use open source Spark.
- OmniOperator does not support decimal128 CHAR or AVG function data or of Spark 3.4.3 or Spark 3.5.2. Such data may cause operation rollback during operation acceleration.
- OmniOperator does not support ORC write for Spark 3.4.3 or Spark 3.5.2.
- OmniOperator supports the ROW_NUMBER and rank functions in Spark 3.5.2. In the dense_rank scenario, operators are rolled back.
- Due to the precision issue of floating-point numbers and different execution sequences, OmniOperator may present different Sum and AVG operation results on the Double type. If you need an accurate result, consider using a data type with higher precision, such as Decimal.
- Spark OmniOperator does not support comparison operators (<, <=, >, >=, !=, <>, =, ==, <=>) for Boolean data, and does not support <=> for any data type. If an incompatible operation exists during the execution, it is normal that the operator is rolled back. If a rollback occurs during the join operation of a large table, performance may deteriorate due to the high overhead of row-to-column conversion. In practice, it is recommended to avoid such scenarios to minimize the impact of rollback on performance.
- When the data storage structure declared by Hive OmniOperator in the table does not match the actual storage structure and the GroupBy operator is consistent with bucketing parameters, the GroupBy operator may encounter a grouping exception in open source Hive. Therefore, to ensure that the declared storage structure matches the actual storage structure, use no bucketing policy when creating the table or run load data local inpath to import data.
- When the sum result overflows, Hive OmniOperator may generate a result different from open source behaviors of the engine. OmniOperator returns Null for users to perceive the overflow, whereas the open source Hive returns an error value, which may cause misunderstanding.
Parent topic: Feature Overview