Executing Spark Engine Services

Use the spark-sql command for operator pushdown on Spark.

In this example, a non-partition TPC-H table with 1 TB data is used as the test table. The test SQL statement is tpch-sql6.

Table 1 lists the related table information.

**Table 1** Table information
Table Name	Format	Total	Occupied Space
lineitem	orc	5999989709	169.6 GB

Operations on Spark 3.0.0

Run the spark-sql command.

        
             spark-sql --num-executors 10 --executor-cores 6 --driver-class-path "/usr/local/spark-plugin-jar/*" --jars "/usr/local/spark-plugin-jar/*" --conf "spark.executor.extraClassPath=./*"

Run the sql6 statement.

select
sum(l_extendedprice * l_discount) as revenue
from
tpch_flat_orc_1000.lineitem
where
l_shipdate >= '1993-01-01'
and l_shipdate < '1994-01-01'
and l_discount between 0.06 - 0.01 and 0.06 + 0.01
and l_quantity < 25;

When the task is executed, the pushdown information is printed as follows:

ndp.NdpPushDown: Selectivity: 0.09795918367346941
ndp.NdpPushDown: Push down with [PushDownInfo(ListBuffer(FilterExeInfo((((((((isnotnull(l_quantity#5) AND isnotnull(l_discount#7)) AND isnotnull(l_shipdate#11)) AND (cast(l_shipdate#11 as date) >= 8401)) AND (cast(l_shipdate#11 as date) < 8766)) AND (l_discount#7 >= 0.05)) AND (l_discount#7 <= 0.07)) AND (l_quantity#5 < 25.0)),List(l_quantity#5, l_extendedprice#6, l_discount#7, l_shipdate#11))),ListBuffer(AggExeInfo(List(sum((l_extendedprice#6 * l_discount#7))),List(),List(sum#36))),None,Map(ceph1 -> ceph1, ceph2 -> ceph2, ceph3 -> ceph3))]

The output contains the pushdown selectivity and operator information.

Table 2 lists the parameters in the spark-sql command.

**Table 2** Operator pushdown parameters
Parameter	Recommended Value	Description
--num-executors	10	Number of started executors. The default value is 2.
--executor-cores	6	Number of CPU cores used by each executor. The default value is 1.
--driver-class-path	"/usr/local/spark-plugin-jar/*"	Path of the extra JAR packages to be transferred to the driver.
--jars	"/usr/local/spark-plugin-jar/*"	JAR packages to be contained in the driver and executor class paths.
--conf	"spark.executor.extraClassPath=./*"	Sets Spark parameters.

Operations on Spark 3.1.1

Logs of the INFO level are not printed on Spark 3.1.1 YARN. Therefore, log redirection is required for Spark 3.1.1.

Define the log file log4j.properties.

log4j.rootCategory=INFO, FILE
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
 
log4j.logger.org.apache.spark.sql.execution=DEBUG
log4j.logger.org.apache.spark.repl.Main=INFO
 
log4j.appender.FILE=org.apache.log4j.FileAppender
log4j.appender.FILE.file=/logs/file.log
log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
 
log4j.appender.FILE.layout.ConversionPattern=%m%n

Change the value of log4j.appender.FILE.file in the log4j.properties file to the customized directory and file name.

Add --driver-java-options -Dlog4j.configuration=file:../conf/log4j.properties when running the spark-sql command, for example:

/usr/local/spark/bin/spark-sql  --driver-class-path '/usr/local/spark-plugin-jar/*' --jars '/usr/local/spark-plugin-jar/*' --conf 'spark.executor.extraClassPath=./*' --name tpch_query6.sql --driver-memory 50G --driver-java-options -Dlog4j.configuration=file:../conf/log4j.properties --executor-memory 32G --num-executors 30 --executor-cores 18

Parent topic: Using OmniData on the Spark Engine