Deployment Process

Move the software packages obtained in Compiling the Spark Algorithm Adaptation Package to the installation directories, as shown in Table 1. The /home/test/boostkit/ is used as an example installation root directory.

This section uses the machine learning algorithm library 2.2.0 version as an example.
The test tool JAR packages and scripts used in this section are for reference only. You need to develop them based on your specific requirements.
Install the algorithm packages only on the client. Do not install them on controller or compute nodes.

**Table 1** Installation directories
Installation Directory	Components to Be Installed
/home/test/boostkit/lib	boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar
	boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar
	boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar
	boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar
	boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar
	boostkit-xgboost4j_2.11-2.2.0.jar
	libboostkit_xgboost_kernel.so
	fastutil-8.3.1.jar (third-party open-source library)
/home/test/boostkit/	JAR test package
/home/test/boostkit/	Shell script for task submission

Perform the following steps:

From the client node, log in to the server as an authorized user of the big data component, install the third-party open-source library fastutil-8.3.1.jar on which the algorithms depend to the corresponding directory, and set the permission for the JAR file to 640.
1. Go to /home/test/boostkit/lib/.
  1
  cd /home/test/boostkit/lib
2. Download the fastutil-8.3.1.jar file.
  1
  wget https://repo1.maven.org/maven2/it/unimi/dsi/fastutil/8.3.1/fastutil-8.3.1.jar
3. Change the permission on the JAR file.
  1
  chmod 640 fastutil-8.3.1.jar

Copy the library adaptation package to the /home/test/boostkit/lib/ directory on the client and set the permission for the package to 550.

cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-core/target/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar /home/test/boostkit/lib
cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-accelerator/target/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar /home/test/boostkit/lib
cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-xgboost/jvm-packages/boostkit-xgboost4j/target/boostkit-xgboost4j_2.11-2.2.0_aarch64.jar /home/test/boostkit/lib
cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-xgboost/jvm-packages/boostkit-xgboost4j-spark/target/boostkit-xgboost4j-spark_2.11-2.2.0_aarch64.jar /home/test/boostkit/lib
chmod 550 /home/test/boostkit/lib/boostkit-*
chmod 550 /home/test/boostkit/lib/libboostkit_xgboost_kernel.so

Save the JAR file (for example, ml-test.jar) of the algorithm test tool developed by yourself to the upper-level directory /home/test/boostkit/ of the library algorithm package files on the client.
You can develop a test tool by yourself or use the test tool kal-test (https://gitee.com/kunpengcompute/Spark-ml-algo-lib/tree/master/tools/kal-test).

If an algorithm other than XGBoost is running, refer to the following script content for the shell script content of a task (select either yarn-client or yarn-cluster mode).

Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows:

#!/bin/bash

spark-submit \
--class com.bigdata.ml.RFMain \
--master yarn \
--deploy-mode client \
--driver-cores 36 \
--driver-memory 50g \
--jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
--conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
--driver-class-path "lib/ml-test.jar:lib/fastutil-8.3.1.jar:lib/snakeyaml-1.17.jar:lib/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:lib/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:lib/boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar"
./ml-test.jar 

Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-cluster mode. An example of the shell script content is as follows:

#!/bin/bash

spark-submit \
--class com.bigdata.ml.RFMain \
--master yarn \
--deploy-mode cluster \
--driver-cores 36 \
--driver-memory 50g \
--jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar,lib/boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
--driver-class-path "ml-test.jar:fastutil-8.3.1.jar:snakeyaml-1.17.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
--conf "spark.yarn.cluster.driver.extraClassPath=ml-test.jar:snakeyaml-1.17.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar" \
--conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
./ml-test.jar 

The XGBoost algorithm involves some C++ code. Therefore, the parameters for the XGBoost algorithm are slightly different from those for other algorithms. The preceding script can be used to submit jobs of algorithms other than XGBoost.
By default, the logs generated during the running of algorithm packages are displayed on the client console and are not stored in files. You can import the customized log4j.properties file to save the logs to your local PC. For details, see Saving Run Logs to a Local PC.
When submitting tasks in Spark single-node cluster mode, you are advised to enable identity authentication and disable the REST API to avoid Spark vulnerabilities.
On different big data platforms, the paths specified by executorEnv.LD_LIBRARY_PATH, spark.executor.extraLibraryPath, and spark.driver.extraLibraryPath may be different from those in the example script. Set the parameters based on the actual scenario.

Table 2 describes the statements in the script.

**Table 2** Description of the statements in the script
Statement	Description
spark-submit	Specifies that jobs are submitted in spark-submit mode.
--class com.bigdata.ml.RFMain	Test program entry function for invoking algorithms
--driver-class-path "XXX"	Path (absolute path recommended) on the client for storing the following files: Files required by the algorithm library: boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, fastutil-8.3.1.jar, boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar, and boostkit-xgboost4j_2.11-2.2.0.jar. If Spark jobs are submitted in yarn-client mode, specify the directories (including the file names) of the JAR files to be referenced on the current node. Separate directories with colons (:). If Spark jobs are submitted in yarn-cluster mode, specify only the names of the JAR files. Separate JAR file names with colons (:).
--conf "spark.executor.extraClassPath=XXX"	JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil. Files required by the algorithm library: boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, fastutil-8.3.1.jar, boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar, and boostkit-xgboost4j_2.11-2.2.0.jar.
--conf "spark.yarn.cluster.driver.extraClassPath=XXX"	JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil. Files required by the algorithm library: boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, fastutil-8.3.1.jar, boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar, and boostkit-xgboost4j_2.11-2.2.0.jar. This parameter needs to be configured only for Spark jobs in yarn-cluster mode. Enter the names of the referenced JAR files. Use colons (:) to separate multiple JAR file names.
--master yarn	Specifies that Spark tasks are submitted on the Yarn cluster.
--deploy-mode cluster	Spark tasks are submitted in cluster mode.
--deploy-mode client	Spark tasks are submitted in client mode.
--driver-cores	Number of cores used by the driver process
--driver-memory	Memory used by the driver, which cannot exceed the total memory of a single node
--jars	JAR files required by algorithms. Enter the directories (including the file names) of the JAR files. Separate directories with commas (,).
--conf spark.executorEnv.LD_LIBRARY_PATH="XXX"	Sets LD_LIBRARY_PATH of the executor so that the path can be loaded to libboostkit_xgboost_kernel.so.
--conf spark.executor.extraLibraryPath="XXX"	Sets the LibraryPath of the executor and sets an extra executor to run the lib directory so that the driver can read the libboostkit_xgboost_kernel.so directory.
--conf spark.driver.extraLibraryPath="XXX"	Sets the LibraryPath of the driver and sets an extra driver to run the lib directory so that the executor can read the libboostkit_xgboost_kernel.so directory.
--files	When a Spark job is executed, the file configured in this parameter is copied to the workspace of the Spark compute node so that the libboostkit_xgboost_kernel.so file can be read.
./ml-test.jar	JAR file that is used as the test program

If the XGBoost algorithm is running, refer to the following script content for the shell script content of a task (select either yarn-client or yarn-cluster mode).

#!/bin/bash

spark-submit \
--class com.bigdata.ml.XGBTRunner\
--master yarn \
--deploy-mode client \
--driver-cores 36 \
--driver-memory 50g \
--jars "lib/boostkit-xgboost4j-spark-kernel_2.11-2.2.0-aarch_64.jar,lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar,lib/boostkit-xgboost4j_2.11-2.2.0.jar" \
--conf "spark.executor.extraClassPath=boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" \
--driver-class-path "lib/ml-test.jar:lib/boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:lib/boostkit-xgboost4j_2.11-2.2.0.jar"
--conf spark.executorEnv.LD_LIBRARY_PATH="./lib/:${LD_LIBRARY_PATH}" \
--conf spark.executor.extraLibraryPath="./lib" \
--conf spark.driver.extraLibraryPath="./lib" \
--files=lib/libboostkit_xgboost_kernel.so  \
./ml-test.jar 

#!/bin/bash

spark-submit \
--class com.bigdata.ml.XGBTRunner\
--master yarn \
--deploy-mode cluster \
--driver-cores 36 \
--driver-memory 50g \
--jars "lib/boostkit-xgboost4j-spark-kernel_2.11-2.2.0-aarch_64.jar,lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar,lib/boostkit-xgboost4j_2.11-2.2.0.jar" \
--conf "spark.executor.extraClassPath=boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" \
--driver-class-path "ml-test.jar:boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar"
--conf "spark.yarn.cluster.driver.extraClassPath=ml-test.jar:boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar"
--conf spark.executorEnv.LD_LIBRARY_PATH="./lib/:${LD_LIBRARY_PATH}" \
--conf spark.executor.extraLibraryPath="./lib" \
--conf spark.driver.extraLibraryPath="./lib" \
--files=lib/libboostkit_xgboost_kernel.so  \
./ml-test.jar 

By default, the logs generated during the running of algorithm packages are displayed on the client console and are not stored in files. You can import the customized log4j.properties file to save the logs to your local PC. For details, see Saving Run Logs to a Local PC.
When submitting tasks in Spark single-node cluster mode, you are advised to enable identity authentication and disable the REST API to avoid Spark vulnerabilities.
On different big data platforms, the paths specified by executorEnv.LD_LIBRARY_PATH, spark.executor.extraLibraryPath, and spark.driver.extraLibraryPath may be different from those in the example script. Set the parameters based on the actual scenario.
For details about the script parameters, see Table 2.

For details about how to upgrade the algorithm library, see Upgrading the Algorithm Library.

Parent topic: Deploying Spark Algorithms in a Kunpeng Cluster