Rate This Document
Findability
Accuracy
Completeness
Readability

Executing Spark Services

Use the spark-sql command for operator pushdown on Spark.

In this example, a non-partition TPC-H table with 1 TB data is used as the test table. The test SQL statement is tpch-sql6.

Table 1 lists the related table information.

Table 1 Table information

Table Name

Format

Total Number of Rows

Occupied Space

lineitem

orc

5999989709

169.6 GB

Operations on Spark 3.1.1

Logs of the INFO level are not printed on Spark 3.1.1 (on Yarn). Therefore, log redirection is required for Spark 3.1.1.

  1. Define the log file log4j.properties in /usr/local/spark/conf.
    log4j.rootCategory=INFO, FILE
    log4j.appender.console=org.apache.log4j.ConsoleAppender
    log4j.appender.console.target=System.err
    log4j.appender.console.layout=org.apache.log4j.PatternLayout
    log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
     
    log4j.logger.org.apache.spark.sql.execution=DEBUG
    log4j.logger.org.apache.spark.repl.Main=INFO
     
    log4j.appender.FILE=org.apache.log4j.FileAppender
    log4j.appender.FILE.file=/usr/local/spark/logs/file.log
    log4j.appender.FILE.layout=org.apache.log4j.PatternLayout
     
    log4j.appender.FILE.layout.ConversionPattern=%m%n
  2. Change the value of log4j.appender.FILE.file in the log4j.properties file to the custom directory and file name.
  3. Add --driver-java-options -Dlog4j.configuration=file:/usr/local/spark/conf/log4j.properties when running the spark-sql command, for example:
    /usr/local/spark/bin/spark-sql  --driver-class-path '/opt/boostkit/*' --jars '/opt/boostkit/*' --conf 'spark.executor.extraClassPath=./*' --name tpch_query6.sql --driver-memory 50G --driver-java-options -Dlog4j.configuration=file:/usr/local/spark/conf/log4j.properties --executor-memory 32G --num-executors 30 --executor-cores 18
  4. Run the sql6 statement.
    select
    sum(l_extendedprice * l_discount) as revenue
    from
    tpch_flat_orc_1000.lineitem
    where
    l_shipdate >= '1993-01-01'
    and l_shipdate < '1994-01-01'
    and l_discount between 0.06 - 0.01 and 0.06 + 0.01
     and l_quantity < 25;

    When the task is executed, the pushdown information is printed as follows:

    Selectivity: 0.014160436451808448
    Push down with [PushDownInfo(ListBuffer(FilterExeInfo((((((((isnotnull(l_shipdate#18) AND isnotnull(l_discount#14)) AND isnotnull(l_quantity#12)) AND (l_shipdate#18 >= 8401)) AND (l_shipdate#18 < 8766)) AND (l_discount#14 >= 0.05)) AND (l_discount#14 <= 0.07)) AND (l_quantity#12 < 25.0)),List(l_quantity#12, l_extendedprice#13, l_discount#14, l_shipdate#18))),ListBuffer(AggExeInfo(List(sum((l_extendedprice#13 * l_discount#14))),List(),List(sum#43))),None,Map(server1 -> server1, agent2 -> agent2, agent1 -> agent1))]

    The output contains the pushdown selectivity and operator information.

    Table 2 lists the parameters in the spark-sql command.

    Table 2 OmniData parameters

    Parameter

    Recommended Value

    Description

    --driver-memory

    50G

    Available memory of the driver

    --executor-memory

    32G

    Available memory of the executor

    --num-executors

    30

    Number of started executors. The default value is 2.

    --executor-cores

    18

    Number of CPU cores used by each executor. The default value is 1.

    --driver-class-path

    "/opt/boostkit/*"

    Path of the extra JAR packages to be transferred to the driver.

    --jars

    "/opt/boostkit/*"

    JAR packages to be contained in the driver and executor class paths.

    --conf

    "spark.executor.extraClassPath=./*"

    Set Spark parameters.

    --driver-java-options -Dlog4j.configuration

    file:/usr/local/spark/conf/log4j.properties

    Log4j log configuration file path.