Query Performance Degraded When Executing a Cast String to Double Expression Containing a Large Number of Columns (for Example, 500 Columns) in Spark SQL

Symptom

When Spark OmniOperator performs an aggregation query on a large number of variable-length columns, for example, 500 columns, the SQL query performance may deteriorate.

Key Process and Cause Analysis

When performing aggregation operations on variable-length type columns, Codegen needs to convert character strings to double-precision floating-point numbers. Codegen itself has compilation overhead. The SQL query performance is affected by both the compile time and operator execution time. When a large number of columns trigger Codegen compilation at the same time, the compilation overhead can actually take longer than running the OmniOperator itself. As a result, the overall SQL query performance is poor.

Conclusion and Solution

If SQL query is to be performed on a large number of columns that involve expression processing and Codegen compilation, you are advised to roll back to the open-source Spark for query, which does not affect the consistency of task results.

Parent topic: Troubleshooting