SQL Query Performance Deteriorated When Spark Executes SQL Statements Containing the cast string to double Expression of over 500 Columns

Symptom

When Spark OmniOperator performs an aggregation query on a large number of varchar columns, for example, 500 columns, the SQL query performance is poor.

Key Process and Cause Analysis

Aggregation operations on varchar columns involve the cast string to double expression processing, which is implemented by Codegen. Codegen itself has compilation overhead. The SQL query performance is reflected by both the compilation overhead and operator execution overhead. When Codegen compilation is required by a large number of columns at the same time, the compilation overhead is much greater than the OmniOperator operator execution time (overhead). As a result, the overall SQL query performance is poor.

Conclusion and Solution

If SQL query is to be performed on a large number of columns that involve expression processing and Codegen compilation, you are advised to roll back to the native Spark for query, which does not affect the consistency of task results.

Parent topic: Troubleshooting