Executing Spark Services
This section uses a non-partition TPC-H table with 1 TB data as the test table and the test SQL statement tpch-sql6 to execute Spark 3.1.1 services.
Use the spark-sql command for operator pushdown on Spark.
In this example, a non-partition TPC-H table with 1 TB data is used as the test table. The test SQL statement is tpch-sql6.
Table 1 lists the related table information.
|
Table Name |
Format |
Total Number of Rows |
Occupied Space |
|---|---|---|---|
|
lineitem |
orc |
5999989709 |
169.6 GB |
Operations on Spark 3.1.1
Logs of the INFO level are not printed on Spark 3.1.1 (on Yarn). Therefore, log redirection is required for Spark 3.1.1.
- Define the log file log4j.properties in /usr/local/spark/conf.
- Create a log file log4j.properties.
1 2
cd /usr/local/spark/conf vi log4j.properties
- Press i to enter the insert mode and add the following content to the file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
log4j.rootCategory=INFO, FILE log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n log4j.logger.org.apache.spark.sql.execution=DEBUG log4j.logger.org.apache.spark.repl.Main=INFO log4j.appender.FILE=org.apache.log4j.FileAppender log4j.appender.FILE.file=/usr/local/spark/logs/file.log log4j.appender.FILE.layout=org.apache.log4j.PatternLayout log4j.appender.FILE.layout.ConversionPattern=%m%n
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Create a log file log4j.properties.
- Change the value of log4j.appender.FILE.file in the log4j.properties file to the custom directory and file name.
- Add --driver-java-options -Dlog4j.configuration=file:/usr/local/spark/conf/log4j.properties when running the spark-sql command, for example:
1/usr/local/spark/bin/spark-sql --driver-class-path '/opt/boostkit/*' --jars '/opt/boostkit/*' --conf 'spark.executor.extraClassPath=./*' --name tpch_query6.sql --driver-memory 50G --driver-java-options -Dlog4j.configuration=file:/usr/local/spark/conf/log4j.properties --executor-memory 32G --num-executors 30 --executor-cores 18
- Run the sql6 statement.
1 2 3 4 5 6 7 8 9
select sum(l_extendedprice * l_discount) as revenue from tpch_flat_orc_1000.lineitem where l_shipdate >= '1993-01-01' and l_shipdate < '1994-01-01' and l_discount between 0.06 - 0.01 and 0.06 + 0.01 and l_quantity < 25;
When the task is executed, the pushdown information is printed as follows:
1 2
Selectivity: 0.014160436451808448 Push down with [PushDownInfo(ListBuffer(FilterExeInfo((((((((isnotnull(l_shipdate#18) AND isnotnull(l_discount#14)) AND isnotnull(l_quantity#12)) AND (l_shipdate#18 >= 8401)) AND (l_shipdate#18 < 8766)) AND (l_discount#14 >= 0.05)) AND (l_discount#14 <= 0.07)) AND (l_quantity#12 < 25.0)),List(l_quantity#12, l_extendedprice#13, l_discount#14, l_shipdate#18))),ListBuffer(AggExeInfo(List(sum((l_extendedprice#13 * l_discount#14))),List(),List(sum#43))),None,Map(server1 -> server1, agent2 -> agent2, agent1 -> agent1))]
The output contains the pushdown selectivity and operator information.

Table 2 lists the parameters in the spark-sql command.
Table 2 OmniData parameters Parameter
Recommended Value
Description
--driver-memory
50G
Available memory of the driver.
--executor-memory
32G
Available memory of the executor.
--num-executors
30
Number of started executors. The default value is 2.
--executor-cores
18
Number of CPU cores used by each executor. The default value is 1.
--driver-class-path
"/opt/boostkit/*"
Path of the extra JAR packages to be transferred to the driver.
--jars
"/opt/boostkit/*"
JAR packages to be contained in the driver and executor class paths.
--conf
"spark.executor.extraClassPath=./*"
Sets Spark parameters.
--driver-java-options -Dlog4j.configuration
file:/usr/local/spark/conf/log4j.properties
Log4j log configuration file path.