Deploying the Spark Engine
Planning the Cluster Environment
The environment planned in this section consists of seven servers, including one task submission node, three compute nodes, and three storage nodes. In the big data cluster, the Spark driver functions as the task submission node, and the compute nodes are agent1, agent2, and agent3. The storage nodes are ceph1, ceph2, and ceph3. See Figure 1.
Table 1 lists the hardware environment of the cluster.
Item |
Model |
|---|---|
Processor |
Kunpeng 920 5220 |
Memory size |
384 GB (12 x 32 GB) |
Memory frequency |
2666 MHz |
NIC |
|
Drive |
|
RAID controller card |
LSI SAS3508 |
Installing Spark 3.0.0
During the installation, select the /usr/local/spark-plugin-jar directory as the software installation directory and place all JAR packages on which Spark compilation depends in this directory, as shown in Table 2.
Installation Node |
Installation Directory |
Component |
How to Obtain |
|---|---|---|---|
Server (server1) |
/usr/local/spark-plugin-jar |
aws-java-sdk-bundle-1.11.375.jar |
Download it from the Kunpeng Community. |
bcpkix-jdk15on-1.68.jar |
Download it from the Kunpeng Community. |
||
boostkit-omnidata-client-1.4.0-aarch64.jar |
Download it from the Huawei Support website. |
||
boostkit-omnidata-common-1.4.0-aarch64.jar |
Download it from the Huawei Support website. |
||
boostkit-omnidata-spark-sql_2.12-3.0.0-1.4.0.jar |
Download it from the Kunpeng Community or use the source code for compilation. |
||
curator-client-2.12.0.jar |
Download it from the Kunpeng Community. |
||
curator-framework-2.12.0.jar |
Download it from the Kunpeng Community. |
||
curator-recipes-2.12.0.jar |
Download it from the Kunpeng Community. |
||
fastjson-1.2.83.jar |
Download it from the Kunpeng Community. |
||
fst-2.57.jar |
Download it from the Kunpeng Community. |
||
guava-26.0-jre.jar |
Download it from the Kunpeng Community. |
||
haf-1.3.0.jar |
Download it from the Huawei Support website. |
||
hdfs-ceph-3.2.0.jar |
Download it from the Kunpeng Community. |
||
hetu-transport-1.6.1.jar |
Download it from the Kunpeng Community. |
||
jackson-datatype-guava-2.12.4.jar |
Download it from the Kunpeng Community. |
||
jackson-datatype-jdk8-2.12.4.jar |
Download it from the Kunpeng Community. |
||
jackson-datatype-joda-2.13.3.jar |
Download it from the Kunpeng Community. |
||
jackson-datatype-jsr310-2.12.4.jar |
Download it from the Kunpeng Community. |
||
jackson-module-parameter-names-2.12.4.jar |
Download it from the Kunpeng Community. |
||
jasypt-1.9.3.jar |
Download it from the Kunpeng Community. |
||
jol-core-0.2.jar |
Download it from the Kunpeng Community. |
||
joni-2.1.5.3.jar |
Download it from the Kunpeng Community. |
||
log-0.193.jar |
Download it from the Kunpeng Community. |
||
perfmark-api-0.23.0.jar |
Download it from the Kunpeng Community. |
||
presto-main-1.6.1.jar |
Download it from the Kunpeng Community. |
||
presto-spi-1.6.1.jar |
Download it from the Kunpeng Community. |
||
protobuf-java-3.12.0.jar |
Download it from the Kunpeng Community. |
||
slice-0.38.jar |
Download it from the Kunpeng Community. |
- Create a /usr/local/spark-plugin-jar directory.
1mkdir -p /usr/local/spark-plugin-jar
- On the task submission node (Spark driver), upload the boostkit-omnidata-client-1.4.0-aarch64.jar and boostkit-omnidata-common-1.4.0-aarch64.jar files (contained in BoostKit-omnidata_1.4.0.zip\BoostKit-omnidata_1.4.0.tar.gz) obtained from Obtaining Software to the /usr/local/spark-plugin-jar directory.
1 2
cp boostkit-omnidata-client-1.4.0-aarch64.jar /usr/local/spark-plugin-jar cp boostkit-omnidata-common-1.4.0-aarch64.jar /usr/local/spark-plugin-jar
- Upload the haf-1.3.0.jar file (contained in BoostKit-haf_1.3.0.zip\haf-1.3.0.tar.gz\haf-host-1.3.0.tar.gz\lib\jar) obtained in Obtaining Software to the /usr/local/spark-plugin-jar directory.
1cp haf-1.3.0.jar /usr/local/spark-plugin-jar
- Upload the hdfs-ceph-3.2.0.jar file obtained in Obtaining Software and the aws-java-sdk-bundle-1.11.375.jar file contained in boostkit-omnidata-server-1.4.0-aarch64-lib.zip to the /usr/local/spark-plugin-jar directory. (If an HDFS storage system is used, skip this step.)
1 2
cp hdfs-ceph-3.2.0.jar /usr/local/spark-plugin-jar cp aws-java-sdk-bundle-1.11.375.jar /usr/local/spark-plugin-jar
- Use an FTP tool to upload the boostkit-omnidata-spark-sql_2.12-3.0.0-1.4.0-aarch64.zip package to the installation environment and decompress the package.
1unzip boostkit-omnidata-spark-sql_2.12-3.0.0-1.4.0-aarch64.zip - Copy the JAR packages in the boostkit-omnidata-spark-sql_2.12-3.0.0-1.4.0-aarch64.zip package to the /usr/local/spark-plugin-jar directory.
- Add new operator pushdown parameters to the Spark configuration file ($SPARK_HOME/conf/spark-defaults.conf).
Replace $SPARK_HOME with /usr/local/spark.
- Edit the Spark configuration file.
1vim /usr/local/spark/conf/spark-defaults.conf - Press i to go to the insert mode and add the following parameters to spark-defaults.conf:
spark.sql.cbo.enabled true spark.sql.cbo.planStats.enabled true spark.sql.ndp.enabled true spark.sql.ndp.filter.selectivity.enable true spark.sql.ndp.filter.selectivity 0.5 spark.sql.ndp.alive.omnidata 3 spark.sql.ndp.table.size.threshold 10 spark.sql.ndp.zookeeper.address agent1:2181,agent2:2181,agent3:2181 spark.sql.ndp.zookeeper.path /sdi/status spark.sql.ndp.zookeeper.timeout 15000 spark.driver.extraLibraryPath /home/omm/omnidata-install/haf-host/lib spark.executor.extraLibraryPath /home/omm/omnidata-install/haf-host/lib spark.executorEnv.HAF_CONFIG_PATH /home/omm/omnidata-install/haf-host/etc/
You can also run the set command to set the preceding parameters in spark-sql.
Table 3 lists the operator pushdown parameters to be added.
Table 3 Operator pushdown parameters Parameter
Recommended Value
Description
spark.sql.cbo.enabled
true
Indicates whether to enable CBO optimization. If this parameter is set to true, CBO is enabled to estimate the execution plan for statistics collection.
spark.sql.cbo.planStats.enabled
true
If this parameter is set to true, the logical plan obtains row and column statistics from the catalog.
spark.sql.ndp.enabled
true
Indicates whether to enable operator pushdown.
spark.sql.ndp.filter.selectivity.enable
true
Indicates whether to enable filter selectivity to determine whether to perform operator pushdown.
spark.sql.ndp.filter.selectivity
0.5
If the actual filter selectivity is less than the value of this parameter, operator pushdown is performed. A smaller value indicates a smaller amount of data to be filtered. The default value is 0.5 and the type is double.
This parameter can be set after spark.sql.ndp.filter.selectivity.enable is set to true.
If forcible pushdown is required, set this parameter to 1.0.
spark.sql.ndp.table.size.threshold
10240
Table size threshold for operator pushdown. The default value is 10240, in bytes. When the actual table size is larger than the value, operator pushdown is performed.
spark.sql.ndp.alive.omnidata
3
Number of OmniData servers in the cluster.
spark.sql.ndp.zookeeper.address
agent1:2181,agent2:2181,agent3:2181
IP address for connecting to ZooKeeper.
spark.sql.ndp.zookeeper.path
/sdi/status
ZooKeeper directory for storing pushed-down resource information.
spark.sql.ndp.zookeeper.timeout
15000
ZooKeeper timeout interval, in ms.
spark.driver.extraLibraryPath
/home/omm/omnidata-install/haf-host/lib
Path of library files on which the Spark driver depends.
spark.executor.extraLibraryPath
/home/omm/omnidata-install/haf-host/lib
Path of library files on which the Spark executor depends.
spark.executorEnv.HAF_CONFIG_PATH
/home/omm/omnidata-install/haf-host/etc/
Path of the configuration file for enabling HAF.
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Edit the Spark configuration file.
Installing Spark 3.1.1
During the installation, select the /usr/local/spark-plugin-jar directory as the software installation directory and place all JAR packages on which Spark compilation depends in this directory, as shown in Table 4.
Installation Node |
Installation Directory |
Component |
How to Obtain |
|---|---|---|---|
Server (server1) |
/usr/local/spark-plugin-jar |
aws-java-sdk-bundle-1.11.375.jar |
Download it from the Kunpeng Community. |
bcpkix-jdk15on-1.68.jar |
Download it from the Kunpeng Community. |
||
boostkit-omnidata-client-1.4.0-aarch64.jar |
Download it from the Huawei Support website. |
||
boostkit-omnidata-common-1.4.0-aarch64.jar |
Download it from the Huawei Support website. |
||
boostkit-omnidata-spark-sql_2.12-3.1.1-1.4.0.jar |
Download it from the Kunpeng Community or use the source code for compilation. |
||
curator-client-2.12.0.jar |
Download it from the Kunpeng Community. |
||
curator-framework-2.12.0.jar |
Download it from the Kunpeng Community. |
||
curator-recipes-2.12.0.jar |
Download it from the Kunpeng Community. |
||
fastjson-1.2.83.jar |
Download it from the Kunpeng Community. |
||
fst-2.57.jar |
Download it from the Kunpeng Community. |
||
guava-26.0-jre.jar |
Download it from the Kunpeng Community. |
||
haf-1.3.0.jar |
Download it from the Huawei Support website. |
||
hdfs-ceph-3.2.0.jar |
Download it from the Kunpeng Community. |
||
hetu-transport-1.6.1.jar |
Download it from the Kunpeng Community. |
||
jackson-datatype-guava-2.12.4.jar |
Download it from the Kunpeng Community. |
||
jackson-datatype-jdk8-2.12.4.jar |
Download it from the Kunpeng Community. |
||
jackson-datatype-joda-2.13.3.jar |
Download it from the Kunpeng Community. |
||
jackson-datatype-jsr310-2.12.4.jar |
Download it from the Kunpeng Community. |
||
jackson-module-parameter-names-2.12.4.jar |
Download it from the Kunpeng Community. |
||
jasypt-1.9.3.jar |
Download it from the Kunpeng Community. |
||
jol-core-0.2.jar |
Download it from the Kunpeng Community. |
||
joni-2.1.5.3.jar |
Download it from the Kunpeng Community. |
||
log-0.193.jar |
Download it from the Kunpeng Community. |
||
perfmark-api-0.23.0.jar |
Download it from the Kunpeng Community. |
||
presto-main-1.6.1.jar |
Download it from the Kunpeng Community. |
||
presto-spi-1.6.1.jar |
Download it from the Kunpeng Community. |
||
protobuf-java-3.12.0.jar |
Download it from the Kunpeng Community. |
||
slice-0.38.jar |
Download it from the Kunpeng Community. |
- Create a /usr/local/spark-plugin-jar directory.
1mkdir -p /usr/local/spark-plugin-jar
- On the task submission node (Spark driver), upload the boostkit-omnidata-client-1.4.0-aarch64.jar and boostkit-omnidata-common-1.4.0-aarch64.jar files (contained in BoostKit-omnidata_1.4.0.zip\BoostKit-omnidata_1.4.0.tar.gz) obtained from Obtaining Software to the /usr/local/spark-plugin-jar directory.
1 2
cp boostkit-omnidata-client-1.4.0-aarch64.jar /usr/local/spark-plugin-jar cp boostkit-omnidata-common-1.4.0-aarch64.jar /usr/local/spark-plugin-jar
- Upload the haf-1.3.0.jar file (contained in BoostKit-haf_1.3.0.zip\haf-1.3.0.tar.gz\haf-host-1.3.0.tar.gz\lib\jar) obtained in Obtaining Software to the /usr/local/spark-plugin-jar directory.
1cp haf-1.3.0.jar /usr/local/spark-plugin-jar
- Upload the hdfs-ceph-3.2.0.jar file obtained in Obtaining Software and the aws-java-sdk-bundle-1.11.375.jar file contained in boostkit-omnidata-server-1.4.0-aarch64-lib.zip to the /usr/local/spark-plugin-jar directory. (If an HDFS storage system is used, skip this step.)
1 2
cp aws-java-sdk-bundle-1.11.375.jar /usr/local/spark-plugin-jar cp hdfs-ceph-3.2.0.jar /usr/local/spark-plugin-jar
- Use an FTP tool to upload the boostkit-omnidata-spark-sql_2.12-3.1.1-1.4.0-aarch64.zip package to the installation environment and decompress the package.
1unzip boostkit-omnidata-spark-sql_2.12-3.1.1-1.4.0-aarch64.zip - Copy the JAR packages in the boostkit-omnidata-spark-sql_2.12-3.1.1-1.4.0-aarch64.zip package to the /usr/local/spark-plugin-jar directory.
- Add new operator pushdown parameters to the Spark configuration file ($SPARK_HOME/conf/spark-defaults.conf).
Replace $SPARK_HOME with /usr/local/spark.
- Edit the Spark configuration file.
1vim /usr/local/spark/conf/spark-defaults.conf - Press i to go to the insert mode and add the following parameters to spark-defaults.conf:
spark.sql.cbo.enabled true spark.sql.cbo.planStats.enabled true spark.sql.ndp.enabled true spark.sql.ndp.filter.selectivity.enable true spark.sql.ndp.filter.selectivity 0.5 spark.sql.ndp.alive.omnidata 3 spark.sql.ndp.table.size.threshold 10 spark.sql.ndp.zookeeper.address agent1:2181,agent2:2181,agent3:2181 spark.sql.ndp.zookeeper.path /sdi/status spark.sql.ndp.zookeeper.timeout 15000 spark.driver.extraLibraryPath /home/omm/omnidata-install/haf-host/lib spark.executor.extraLibraryPath /home/omm/omnidata-install/haf-host/lib spark.executorEnv.HAF_CONFIG_PATH /home/omm/omnidata-install/haf-host/etc/
You can also run the set command to set the preceding parameters in spark-sql.
Table 5 lists the operator pushdown parameters to be added.
Table 5 Operator pushdown parameters Parameter
Recommended Value
Description
spark.sql.cbo.enabled
true
Indicates whether to enable CBO optimization. If this parameter is set to true, CBO is enabled to estimate the execution plan for statistics collection.
spark.sql.cbo.planStats.enabled
true
If this parameter is set to true, the logical plan obtains row and column statistics from the catalog.
spark.sql.ndp.enabled
true
Indicates whether to enable operator pushdown.
spark.sql.ndp.filter.selectivity.enable
true
Indicates whether to enable filter selectivity to determine whether to perform operator pushdown.
spark.sql.ndp.filter.selectivity
0.5
If the actual filter selectivity is less than the value of this parameter, operator pushdown is performed. A smaller value indicates a smaller amount of data to be filtered. The default value is 0.5 and the type is double.
This parameter can be set after spark.sql.ndp.filter.selectivity.enable is set to true.
If forcible pushdown is required, set this parameter to 1.0.
spark.sql.ndp.table.size.threshold
10240
Table size threshold for operator pushdown. The default value is 10240, in bytes. When the actual table size is larger than the value, operator pushdown is performed.
spark.sql.ndp.alive.omnidata
3
Number of OmniData servers in the cluster.
spark.sql.ndp.zookeeper.address
agent1:2181,agent2:2181,agent3:2181
IP address for connecting to ZooKeeper.
spark.sql.ndp.zookeeper.path
/sdi/status
ZooKeeper directory for storing pushed-down resource information.
spark.sql.ndp.zookeeper.timeout
15000
ZooKeeper timeout interval, in ms.
spark.driver.extraLibraryPath
/home/omm/omnidata-install/haf-host/lib
Path of library files on which the Spark driver depends.
spark.executor.extraLibraryPath
/home/omm/omnidata-install/haf-host/lib
Path of library files on which the Spark executor depends.
spark.executorEnv.HAF_CONFIG_PATH
/home/omm/omnidata-install/haf-host/etc/
Path of the configuration file for enabling HAF.
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Edit the Spark configuration file.
- HAF log directory on a host node: /home/omm/omnidata-install/haf-host/logs
- If the cluster is in secure mode, set the spark.sql.ndp.zookeeper.jaas.conf and spark.sql.ndp.zookeeper.krb5.conf parameters to the file paths of JAAS and krb5, respectively. The file paths are user-defined.
