Rate This Document
Findability
Accuracy
Completeness
Readability

Executing Spark Engine Services

Use the spark-sql command for operator pushdown on Spark.

In this example, a non-partition TPC-H table with 1 TB data is used as the test table. The test SQL statement is tpch-sql6.

Table 1 lists the related table information.

Table 1 Table information

Table Name

Format

Total

Occupied Space

lineitem

orc

5999989709

169.6 GB

Operations on Spark 3.0.0

  1. Run the spark-sql command.
    1
    spark-sql --num-executors 10 --executor-cores 6 --driver-class-path "/usr/local/spark-plugin-jar/*" --jars "/usr/local/spark-plugin-jar/*" --conf "spark.executor.extraClassPath=./*"
    

  2. Run the sql6 statement.
    select
    sum(l_extendedprice * l_discount) as revenue
    from
    tpch_flat_orc_1000.lineitem
    where
    l_shipdate >= '1993-01-01'
    and l_shipdate < '1994-01-01'
    and l_discount between 0.06 - 0.01 and 0.06 + 0.01
    and l_quantity < 25;

    When the task is executed, the pushdown information is printed as follows:

    ndp.NdpPushDown: Selectivity: 0.09795918367346941
    ndp.NdpPushDown: Push down with [PushDownInfo(ListBuffer(FilterExeInfo((((((((isnotnull(l_quantity#5) AND isnotnull(l_discount#7)) AND isnotnull(l_shipdate#11)) AND (cast(l_shipdate#11 as date) >= 8401)) AND (cast(l_shipdate#11 as date) < 8766)) AND (l_discount#7 >= 0.05)) AND (l_discount#7 <= 0.07)) AND (l_quantity#5 < 25.0)),List(l_quantity#5, l_extendedprice#6, l_discount#7, l_shipdate#11))),ListBuffer(AggExeInfo(List(sum((l_extendedprice#6 * l_discount#7))),List(),List(sum#36))),None,Map(ceph1 -> ceph1, ceph2 -> ceph2, ceph3 -> ceph3))]

    The output contains the pushdown selectivity and operator information.

    Table 2 lists the parameters in the spark-sql command.

    Table 2 Operator pushdown parameters

    Parameter

    Recommended Value

    Description

    --num-executors

    10

    Number of started executors. The default value is 2.

    --executor-cores

    6

    Number of CPU cores used by each executor. The default value is 1.

    --driver-class-path

    "/usr/local/spark-plugin-jar/*"

    Path of the extra JAR packages to be transferred to the driver.

    --jars

    "/usr/local/spark-plugin-jar/*"

    JAR packages to be contained in the driver and executor class paths.

    --conf

    "spark.executor.extraClassPath=./*"

    Sets Spark parameters.

Operations on Spark 3.1.1

Logs of the INFO level are not printed on Spark 3.1.1 YARN. Therefore, log redirection is required for Spark 3.1.1.

  1. Define the log file log4j.properties.
    log4j.rootCategory=INFO, FILE
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
     
    log4j.logger.org.apache.spark.sql.execution=DEBUG
    log4j.logger.org.apache.spark.repl.Main=INFO
     
    log4j.appender.FILE=org.apache.log4j.FileAppender
    log4j.appender.FILE.file=/logs/file.log
    log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
     
    log4j.appender.FILE.layout.ConversionPattern=%m%n
  2. Change the value of log4j.appender.FILE.file in the log4j.properties file to the customized directory and file name.
  3. Add --driver-java-options -Dlog4j.configuration=file:../conf/log4j.properties when running the spark-sql command, for example:
    /usr/local/spark/bin/spark-sql  --driver-class-path '/usr/local/spark-plugin-jar/*' --jars '/usr/local/spark-plugin-jar/*' --conf 'spark.executor.extraClassPath=./*' --name tpch_query6.sql --driver-memory 50G --driver-java-options -Dlog4j.configuration=file:../conf/log4j.properties --executor-memory 32G --num-executors 30 --executor-cores 18