Executing Spark Services

Use the spark-sql command for operator pushdown on Spark.

In this example, a non-partition TPC-H table with 1 TB data is used as the test table. The test SQL statement is tpch-sql6.

Table 1 lists the related table information.

**Table 1** Table information
Table Name	Format	Total Number of Rows	Occupied Space
lineitem	orc	5999989709	169.6 GB

Operations on Spark 3.1.1

Logs of the INFO level are not printed on Spark 3.1.1 (on Yarn). Therefore, log redirection is required for Spark 3.1.1.

Define the log file log4j.properties in /usr/local/spark/conf.

log4j.rootCategory=INFO, FILE
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
 
log4j.logger.org.apache.spark.sql.execution=DEBUG
log4j.logger.org.apache.spark.repl.Main=INFO
 
log4j.appender.FILE=org.apache.log4j.FileAppender
log4j.appender.FILE.file=/usr/local/spark/logs/file.log
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
 
log4j.appender.FILE.layout.ConversionPattern=%m%n

Change the value of log4j.appender.FILE.file in the log4j.properties file to the custom directory and file name.

Add --driver-java-options -Dlog4j.configuration=file:/usr/local/spark/conf/log4j.properties when running the spark-sql command, for example:

/usr/local/spark/bin/spark-sql  --driver-class-path '/opt/boostkit/*' --jars '/opt/boostkit/*' --conf 'spark.executor.extraClassPath=./*' --name tpch_query6.sql --driver-memory 50G --driver-java-options -Dlog4j.configuration=file:/usr/local/spark/conf/log4j.properties --executor-memory 32G --num-executors 30 --executor-cores 18

Run the sql6 statement.

select
sum(l_extendedprice * l_discount) as revenue
from
tpch_flat_orc_1000.lineitem
where
l_shipdate >= '1993-01-01'
and l_shipdate < '1994-01-01'
and l_discount between 0.06 - 0.01 and 0.06 + 0.01
 and l_quantity < 25;

When the task is executed, the pushdown information is printed as follows:

Selectivity: 0.014160436451808448
Push down with [PushDownInfo(ListBuffer(FilterExeInfo((((((((isnotnull(l_shipdate#18) AND isnotnull(l_discount#14)) AND isnotnull(l_quantity#12)) AND (l_shipdate#18 >= 8401)) AND (l_shipdate#18 < 8766)) AND (l_discount#14 >= 0.05)) AND (l_discount#14 <= 0.07)) AND (l_quantity#12 < 25.0)),List(l_quantity#12, l_extendedprice#13, l_discount#14, l_shipdate#18))),ListBuffer(AggExeInfo(List(sum((l_extendedprice#13 * l_discount#14))),List(),List(sum#43))),None,Map(server1 -> server1, agent2 -> agent2, agent1 -> agent1))]

The output contains the pushdown selectivity and operator information.

Table 2 lists the parameters in the spark-sql command.

**Table 2** OmniData parameters
Parameter	Recommended Value	Description
--driver-memory	50G	Available memory of the driver
--executor-memory	32G	Available memory of the executor
--num-executors	30	Number of started executors. The default value is 2.
--executor-cores	18	Number of CPU cores used by each executor. The default value is 1.
--driver-class-path	"/opt/boostkit/*"	Path of the extra JAR packages to be transferred to the driver.
--jars	"/opt/boostkit/*"	JAR packages to be contained in the driver and executor class paths.
--conf	"spark.executor.extraClassPath=./*"	Set Spark parameters.
--driver-java-options -Dlog4j.configuration	file:/usr/local/spark/conf/log4j.properties	Log4j log configuration file path.

Parent topic: Using OmniData on Spark