Installing Spark

The OmniData feature supports the Spark engine. Before using OmniData, install the Spark engine and add OmniData parameters to the Spark engine.

Planning the Cluster Environment

The environment planned in this section consists of seven servers, including one task submission node, three compute nodes, and three storage nodes. In the big data cluster, the Spark driver functions as the task submission node, and the compute nodes are agent1, agent2, and agent3. The storage nodes are ceph1, ceph2, and ceph3. See Figure 1.

Figure 1 Configuring the Environment

Table 1 describes the cluster hardware environment.

**Table 1** Hardware configuration
Item	Configuration
Processor	Kunpeng 920
Memory size	384 GB (12 x 32 GB)
Memory frequency	2666 MHz
Network	Ceph environment: 25GE for the service network and GE for the management network HDFS environment: 10GE for the service network and GE for the management network
Drive	System drive: 1 x RAID 0 (1 x 1.2 TB SAS HDD) Management node: 12 x RAID 0 (1 x 4 TB SATA HDD) Compute node: Ceph environment: 1 x 3.2 TB NVMe HDFS environment: 12 x RAID 0 (1 x 4 TB SATA HDD) Storage node Ceph environment: 12 x RAID 0 (1 x 4 TB SATA HDD) 1 x 3.2 TB NVMe HDFS environment: 12 x RAID 0 (1 x 4 TB SATA HDD)
RAID controller card	LSI SAS3508

Installing Spark 3.1.1

During the installation, select the /opt/boostkit directory as the software installation directory and place all JAR packages on which Spark compilation depends in this directory.

The aws-java-sdk-bundle-1.11.375.jar and hdfs-ceph-3.2.0.jar packages need to be added in the Ceph environment. The HDFS environment does not require these packages. The other packages listed in the previous table can be obtained by running the spark_build.sh script on Gitee.

Create an /opt/boostkit directory.
1
mkdir -p /opt/boostkit
On the task submission node (Spark driver), upload the boostkit-omnidata-client-1.5.0-aarch64.jar and boostkit-omnidata-common-1.5.0-aarch64.jar files (contained in BoostKit-omnidata_1.5.0.zip\BoostKit-omnidata_1.5.0.tar.gz) obtained in Obtaining Software to the /opt/boostkit directory.
1 2
cp boostkit-omnidata-client-1.5.0-aarch64.jar /opt/boostkit cp boostkit-omnidata-common-1.5.0-aarch64.jar /opt/boostkit
Upload the haf-1.4.0.jar file (contained in BoostKit-haf_1.4.0.zip\haf-1.4.0.tar.gz\haf-host-1.4.0.tar.gz\lib\jar) obtained in Obtaining Software to the /opt/boostkit directory.
1
cp haf-1.4.0.jar /opt/boostkit
Upload hdfs-ceph-3.2.0.jar obtained in Obtaining Software and aws-java-sdk-bundle-1.11.375.jar in boostkit-omnidata-server-1.5.0-aarch64-lib.zip to the /opt/boostkit directory. (If the HDFS storage system is used, skip this step.)
1 2
cp hdfs-ceph-3.2.0.jar /opt/boostkit cp aws-java-sdk-bundle-1.11.375.jar /opt/boostkit
Use an FTP tool to upload the boostkit-omnidata-spark-sql_2.12-3.1.1-1.5.0-aarch64.zip package obtained in Obtaining Software to the installation environment and decompress the package.
1
unzip boostkit-omnidata-spark-sql_2.12-3.1.1-1.5.0-aarch64.zip
Copy the JAR packages in the boostkit-omnidata-spark-sql_2.12-3.1.1-1.5.0-aarch64.zip package to the /opt/boostkit directory.
1 2
cd boostkit-omnidata-spark-sql_2.12-3.1.1-1.5.0-aarch64 cp *.jar /opt/boostkit
If you need to manually compile boostkit-omnidata-spark-sql_2.12-3.1.1-1.5.0.jar, compile it based on README.md.

Add new OmniData parameters to the Spark configuration file ($SPARK_HOME/conf/spark-defaults.conf).

Replace $SPARK_HOME with /usr/local/spark.

Open the Spark configuration file.

1	vi /usr/local/spark/conf/spark-defaults.conf

Press i to go to the insert mode and add the following parameters to spark-defaults.conf:

spark.sql.cbo.enabled   true
spark.sql.cbo.planStats.enabled true
spark.sql.ndp.enabled   true
spark.sql.ndp.filter.selectivity.enable true
spark.sql.ndp.filter.selectivity    0.5
spark.sql.ndp.alive.omnidata 3
spark.sql.ndp.table.size.threshold  10240
spark.sql.ndp.zookeeper.address agent1:2181,agent2:2181,agent3:2181
spark.sql.ndp.zookeeper.path    /sdi/status
spark.sql.ndp.zookeeper.timeout 15000
spark.driver.extraLibraryPath   /home/omm/omnidata-install/haf-host/lib
spark.executor.extraLibraryPath  /home/omm/omnidata-install/haf-host/lib
spark.executorEnv.HAF_CONFIG_PATH /home/omm/omnidata-install/haf-host/etc/

You can also run the set command to set the preceding parameters in spark-sql.

Table 2 lists the OmniData parameters to be added.

**Table 2** OmniData parameters
Parameter	Recommended Value	Description
spark.sql.cbo.enabled	true	Indicates whether to enable CBO optimization. If this parameter is set to true, CBO is enabled to estimate the execution plan for statistics collection.
spark.sql.cbo.planStats.enabled	true	If this parameter is set to true, the logical plan obtains row and column statistics from the catalog.
spark.sql.ndp.enabled	true	Indicates whether to enable OmniData.
spark.sql.ndp.filter.selectivity.enable	true	Indicates whether to enable filter selectivity to determine whether to perform operator pushdown.
spark.sql.ndp.filter.selectivity	0.5	If the actual filter selectivity is less than the value of this parameter, operator pushdown is performed. A smaller value indicates a smaller amount of data to be filtered. The default value is 0.5 and the type is double. This parameter can be set after spark.sql.ndp.filter.selectivity.enable is set to true. If forcible pushdown is required, set this parameter to 1.0.
spark.sql.ndp.table.size.threshold	10240	Table size threshold for operator pushdown. The default value is 10240, in bytes. When the actual table size is larger than the value, operator pushdown is performed.
spark.sql.ndp.alive.omnidata	3	Number of OmniData servers in the cluster.
spark.sql.ndp.zookeeper.address	agent1:2181,agent2:2181,agent3:2181	IP address for connecting to ZooKeeper.
spark.sql.ndp.zookeeper.path	/sdi/status	ZooKeeper directory for storing pushed-down resource information.
spark.sql.ndp.zookeeper.timeout	15000	ZooKeeper timeout duration, in milliseconds.
spark.driver.extraLibraryPath	/home/omm/omnidata-install/haf-host/lib	Path to library files on which the Spark driver depends.
spark.executor.extraLibraryPath	/home/omm/omnidata-install/haf-host/lib	Path to library files on which the Spark executor depends.
spark.executorEnv.HAF_CONFIG_PATH	/home/omm/omnidata-install/haf-host/etc/	Path to the configuration file for enabling HAF.

Press Esc, type :wq!, and press Enter to save the file and exit.

HAF log directory on a compute node: /home/omm/omnidata-install/haf-host/logs
If the cluster is in secure mode, set the spark.sql.ndp.zookeeper.jaas.conf and spark.sql.ndp.zookeeper.krb5.conf parameters to the file paths of JAAS and krb5, respectively. The file paths are user-defined.

Parent topic: Using the Feature