Rate This Document
Findability
Accuracy
Completeness
Readability

Deploying the Spark Engine

Planning the Cluster Environment

The environment planned in this section consists of seven servers, including one task submission node, three compute nodes, and three storage nodes. In the big data cluster, the Spark driver functions as the task submission node, and the compute nodes are agent1, agent2, and agent3. The storage nodes are ceph1, ceph2, and ceph3. See Figure 1.

Figure 1 Environment configuration

Table 1 lists the hardware environment of the cluster.

Table 1 Hardware configurations

Item

Model

Processor

Kunpeng 920 5220

Memory capacity

384 GB (12 x 32 GB)

Memory frequency

2666 MHz

NIC

25GE for the service network and GE for the management network

Drive

System drive: 1 x RAID 0 (1 x 1.2 TB SAS HDD)

Management node: 12 x RAID 0 (1 x 4 TB SATA HDD)

Service node: 12 x RAID 0 (1 x 4 TB SATA HDD) 1 x 3.2 TB NVMe

RAID controller card

LSI SAS3508

Table 2 lists the required software versions.

Table 2 Software configurations

Item

Version

OS

openEuler 20.03 LTS SP1

JDK

BiSheng JDK-8u262

Hadoop

3.2.0

Spark

3.0.0

Hive

3.1.0

ZooKeeper

3.6.2

Ceph

14.2.8

Installing the Spark Engine

During the installation, select the /usr/local/spark-plugin-jar directory as the software installation directory and place all JAR packages on which Spark compilation depends in this directory, as shown in Table 3.

Table 3 Installation directory

Installation Node

Installation Directory

Component

How to Obtain

Server (server1)

/usr/local/spark-plugin-jar

slice-0.38.jar

Download it from the Kunpeng Community.

boostkit-omnidata-server-1.1.0-aarch64.jar

Download it from the Huawei Support website.

boostkit-omnidata-spark-sql_2.12-3.0.0-1.1.0.jar

Download it from the Kunpeng Community or use the source code for compilation.

commons-lang3-3.10.jar

Download it from the Kunpeng Community.

curator-client-2.12.0.jar

Download it from the Kunpeng Community.

curator-framework-2.12.0.jar

Download it from the Kunpeng Community.

curator-recipes-2.12.0.jar

Download it from the Kunpeng Community.

guava-26.0-jre.jar

Download it from the Kunpeng Community.

hadoop-aws-3.2.0.jar

Download it from the Kunpeng Community.

hetu-transport-1.4.1.jar

Download it from the Kunpeng Community.

hdfs-ceph-3.2.0.jar

Download it from the Kunpeng Community.

jackson-datatype-guava-2.12.4.jar

Download it from the Kunpeng Community.

jackson-datatype-jdk8-2.12.4.jar

Download it from the Kunpeng Community.

jackson-datatype-joda-2.12.4.jar

Download it from the Kunpeng Community.

jackson-datatype-jsr310-2.12.4.jar

Download it from the Kunpeng Community.

jackson-module-parameter-names-2.12.4.jar

Download it from the Kunpeng Community.

jasypt-1.9.3.jar

Download it from the Kunpeng Community.

haf-jni-call-1.0.jar

Download it from the Huawei Support website.

jol-core-0.2.jar

Download it from the Kunpeng Community.

joni-2.1.5.3.jar

Download it from the Kunpeng Community.

log-0.193.jar

Download it from the Kunpeng Community.

perfmark-api-0.23.0.jar

Download it from the Kunpeng Community.

presto-main-1.4.1.jar

Download it from the Kunpeng Community.

presto-spi-1.4.1.jar

Download it from the Kunpeng Community.

protobuf-java-3.12.0.jar

Download it from the Kunpeng Community.

  1. Create a /usr/local/spark-plugin-jar directory.
    1
    mkdir -p /usr/local/spark-plugin-jar
    
  2. On the task submission node (Spark driver), upload the server JAR package (boostkit-omnidata-spark-sql_2.12-3.0.0-1.1.0.zip/boostkit-omnidata-openlookeng-1.4.1-1.1.0-aarch64/boostkit-omnidata-server-1.1.0-aarch64.jar) prepared in Obtaining Software to the /usr/local/spark-plugin-jar directory.
    1
    cp boostkit-omnidata-server-1.1.0-aarch64.jar /usr/local/spark-plugin-jar
    
  3. Upload haf-jni-call-1.0.jar (in the BoostKit-haf_1.0.zip\haf-1.0.tar.gz\haf-host-1.0.tar.gz\lib\jar directory) obtained in Obtaining Software to the /usr/local/spark-plugin-jar directory.
    1
    cp haf-jni-call-1.0.jar /usr/local/spark-plugin-jar
    
  4. Upload hdfs-ceph-3.2.0.jar obtained in Obtaining Software to the /usr/local/spark-plugin-jar directory.
    1
    cp hdfs-ceph-3.2.0.jar /usr/local/spark-plugin-jar
    
  5. Use the FTP tool to upload the boostkit-omnidata-spark-sql_2.12-3.0.0-1.1.0-aarch64.zip package to the installation environment and decompress the package.
    1
    unzip boostkit-omnidata-spark-sql_2.12-3.0.0-1.1.0-aarch64.zip
    
  6. Copy the JAR packages in the boostkit-omnidata-spark-sql_2.12-3.0.0-1.1.0-aarch64.zip package to the /usr/local/spark-plugin-jar directory.
    1
    2
    cd boostkit-omnidata-spark-sql_2.12-3.0.0-1.1.0-aarch64
    cp *.jar /usr/local/spark-plugin-jar
    

    If you need to manually compile boostkit-omnidata-spark-sql_2.12-3.0.0-1.1.0.jar, compile it based on README.md.

  7. Add new operator pushdown parameters to the Spark configuration file ($SPARK_HOME/conf/spark-defaults.conf).

    Replace $SPARK_HOME with /usr/local/spark.

    1. Edit the Spark configuration file.
      1
      vim /usr/local/spark/conf/spark-defaults.conf
      
    2. Add the following parameters to spark-defaults.conf:
      spark.sql.cbo.enabled   true
      spark.sql.cbo.planStats.enabled true
      spark.sql.ndp.enabled   true
      spark.sql.ndp.filter.selectivity.enable true
      spark.sql.ndp.filter.selectivity    0.5
      spark.sql.ndp.alive.omnidata 3
      spark.sql.ndp.table.size.threshold  10
      spark.sql.ndp.zookeeper.address agent1:2181,agent2:2181,agent3:2181
      spark.sql.ndp.zookeeper.path    /sdi/status
      spark.sql.ndp.zookeeper.timeout 15000
      spark.driver.extraLibraryPath   /opt/haf-install/haf-host/lib
      spark.executor.extraLibraryPath  /opt/haf-install/haf-host/lib
      spark.executorEnv.HAF_CONFIG_PATH /opt/haf-install/haf-host/

      You can also run the set command to set the preceding parameters in spark-sql.

      Table 4 lists the operator pushdown parameters to be added.

      Table 4 Operator pushdown parameters

      Parameter

      Recommended Value

      Description

      spark.sql.cbo.enabled

      true

      Indicates whether to enable CBO optimization. If this parameter is set to true, CBO is enabled to estimate the execution plan for statistics collection.

      spark.sql.cbo.planStats.enabled

      true

      If this parameter is set to true, the logical plan obtains row and column statistics from the catalog.

      spark.sql.ndp.enabled

      true

      Indicates whether to enable operator pushdown.

      spark.sql.ndp.filter.selectivity.enable

      true

      Indicates whether to enable filter selectivity to determine whether to perform operator pushdown.

      spark.sql.ndp.filter.selectivity

      0.5

      If the actual filter selectivity is less than the value of this parameter, operator pushdown is performed. A smaller value indicates a smaller amount of data to be filtered. The default value is 0.5 and the type is double.

      This parameter can be set after spark.sql.ndp.filter.selectivity.enable is set to true.

      If forcible pushdown is required, set this parameter to 1.0.

      spark.sql.ndp.table.size.threshold

      10240

      Table size threshold for operator pushdown. The default value is 10240, in bytes. When the actual table size is larger than the value, operator pushdown is performed.

      spark.sql.ndp.alive.omnidata

      3

      Number of OmniData servers in the cluster.

      spark.sql.ndp.zookeeper.address

      agent1:2181,agent2:2181,agent3:2181

      IP address for connecting to ZooKeeper.

      spark.sql.ndp.zookeeper.path

      /sdi/status

      ZooKeeper directory for storing pushed-down resource information.

      spark.sql.ndp.zookeeper.timeout

      15000

      ZooKeeper timeout interval, in ms.

      spark.driver.extraLibraryPath

      /opt/haf-install/haf-host/lib

      Path of library files on which the Spark driver depends.

      spark.executor.extraLibraryPath

      /opt/haf-install/haf-host/lib

      Path of library files on which the Spark executor depends.

      spark.executorEnv.HAF_CONFIG_PATH

      /opt/haf-install/haf-host/

      Path of the configuration file for enabling the HAF.

HAF log directory on a host node: /var/log/haf-host/haf-user