Using in a Hadoop Security Cluster

In secure mode, you can adjust the SparkExtension plugin startup parameters for ORC and Parquet datasets to optimize performance.

Scenario 1: In secure mode, the dataset format is ORC and native ORC is half disabled.
Add the following parameters based on the Spark Extension plugin startup command provided in Executing Spark Services:
1
--conf spark.sql.codegen.wholeStage=true --conf spark.omni.sql.columnar.nativefilescan=true --conf spark.omni.sql.columnar.orcNativefilescan=false
- When native ORC is half disabled, data is read in the native ORC format but OmniOperator converts the native data structure into OmniVector.
- The test result of a TPC-DS 99 performance test for a 3 TB dataset shows that when native ORC is half disabled, the average performance loss in secure mode is about 10% of that brought about by native ORC in non-secure mode.
Scenario 2: In secure mode, the dataset format is ORC, and native ORC is fully disabled, that is, data is processed in ORC table scan mode.
Add the following parameters based on the Spark Extension plugin startup command provided in Executing Spark Services:
1
--conf spark.sql.codegen.wholeStage=false --conf spark.omni.sql.columnar.nativefilescan=false --conf spark.omni.sql.columnar.orcNativefilescan=false
The test result of a TPC-DS 99 performance test for a 3 TB dataset shows that when native ORC is fully disabled, the average performance loss in secure mode is about 17% of that brought about by native ORC in non-secure mode.
Scenario 3: In secure mode, the dataset format is Parquet and native Parquet is used by default.
Add the following parameters based on the Spark Extension plugin startup command provided in Executing Spark Services:
1
--conf spark.omni.sql.columnar.nativefilescan=true
Compared with native ORC in non-secure mode, native Parquet in secure mode has an average performance loss of approximately 5%.

Parent topic: Using on Spark