Deployment Process
- This section uses the machine learning algorithm library 2.2.0 version as an example.
- The test tool JAR packages and scripts used in this section are for reference only. You need to develop them based on your specific requirements.
- Install the algorithm packages only on the client. Do not install them on controller or compute nodes.
Installation Directory |
Components to Be Installed |
|---|---|
/home/test/boostkit/lib |
boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar |
boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar |
|
boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar |
|
boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar |
|
boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar |
|
boostkit-xgboost4j_2.11-2.2.0.jar |
|
libboostkit_xgboost_kernel.so |
|
fastutil-8.3.1.jar (third-party open-source library) |
|
/home/test/boostkit/ |
JAR test package |
Shell script for task submission |
Perform the following steps:
- From the client node, log in to the server as an authorized user of the big data component, install the third-party open-source library fastutil-8.3.1.jar on which the algorithms depend to the corresponding directory, and set the permission for the JAR file to 640.
- Go to /home/test/boostkit/lib/.
1cd /home/test/boostkit/lib
- Download the fastutil-8.3.1.jar file.
1wget https://repo1.maven.org/maven2/it/unimi/dsi/fastutil/8.3.1/fastutil-8.3.1.jar - Change the permission on the JAR file.
1chmod 640 fastutil-8.3.1.jar
- Go to /home/test/boostkit/lib/.
- Copy the library adaptation package to the /home/test/boostkit/lib/ directory on the client and set the permission for the package to 550.
1 2 3 4 5 6
cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-core/target/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar /home/test/boostkit/lib cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-accelerator/target/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar /home/test/boostkit/lib cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-xgboost/jvm-packages/boostkit-xgboost4j/target/boostkit-xgboost4j_2.11-2.2.0_aarch64.jar /home/test/boostkit/lib cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-xgboost/jvm-packages/boostkit-xgboost4j-spark/target/boostkit-xgboost4j-spark_2.11-2.2.0_aarch64.jar /home/test/boostkit/lib chmod 550 /home/test/boostkit/lib/boostkit-* chmod 550 /home/test/boostkit/lib/libboostkit_xgboost_kernel.so
- Save the JAR file (for example, ml-test.jar) of the algorithm test tool developed by yourself to the upper-level directory /home/test/boostkit/ of the library algorithm package files on the client.
- If an algorithm other than XGBoost is running, refer to the following script content for the shell script content of a task (select either yarn-client or yarn-cluster mode).
- Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows:
1 2 3 4 5 6 7 8 9 10 11 12
#!/bin/bash spark-submit \ --class com.bigdata.ml.RFMain \ --master yarn \ --deploy-mode client \ --driver-cores 36 \ --driver-memory 50g \ --jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \ --conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \ --driver-class-path "lib/ml-test.jar:lib/fastutil-8.3.1.jar:lib/snakeyaml-1.17.jar:lib/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:lib/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:lib/boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" ./ml-test.jar
- Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-cluster mode. An example of the shell script content is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash spark-submit \ --class com.bigdata.ml.RFMain \ --master yarn \ --deploy-mode cluster \ --driver-cores 36 \ --driver-memory 50g \ --jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar,lib/boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \ --driver-class-path "ml-test.jar:fastutil-8.3.1.jar:snakeyaml-1.17.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \ --conf "spark.yarn.cluster.driver.extraClassPath=ml-test.jar:snakeyaml-1.17.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar" \ --conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \ ./ml-test.jar
- The XGBoost algorithm involves some C++ code. Therefore, the parameters for the XGBoost algorithm are slightly different from those for other algorithms. The preceding script can be used to submit jobs of algorithms other than XGBoost.
- By default, the logs generated during the running of algorithm packages are displayed on the client console and are not stored in files. You can import the customized log4j.properties file to save the logs to your local PC. For details, see Saving Run Logs to a Local PC.
- When submitting tasks in Spark single-node cluster mode, you are advised to enable identity authentication and disable the REST API to avoid Spark vulnerabilities.
- On different big data platforms, the paths specified by executorEnv.LD_LIBRARY_PATH, spark.executor.extraLibraryPath, and spark.driver.extraLibraryPath may be different from those in the example script. Set the parameters based on the actual scenario.
Table 2 describes the statements in the script.
Table 2 Description of the statements in the script Statement
Description
spark-submit
Specifies that jobs are submitted in spark-submit mode.
--class com.bigdata.ml.RFMain
Test program entry function for invoking algorithms
--driver-class-path "XXX"
Path (absolute path recommended) on the client for storing the following files:
Files required by the algorithm library: boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, fastutil-8.3.1.jar, boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar, and boostkit-xgboost4j_2.11-2.2.0.jar.
If Spark jobs are submitted in yarn-client mode, specify the directories (including the file names) of the JAR files to be referenced on the current node. Separate directories with colons (:).
If Spark jobs are submitted in yarn-cluster mode, specify only the names of the JAR files. Separate JAR file names with colons (:).
--conf "spark.executor.extraClassPath=XXX"
JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil.
Files required by the algorithm library: boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, fastutil-8.3.1.jar, boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar, and boostkit-xgboost4j_2.11-2.2.0.jar.
--conf "spark.yarn.cluster.driver.extraClassPath=XXX"
JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil.
Files required by the algorithm library: boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, fastutil-8.3.1.jar, boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar, and boostkit-xgboost4j_2.11-2.2.0.jar.
This parameter needs to be configured only for Spark jobs in yarn-cluster mode. Enter the names of the referenced JAR files. Use colons (:) to separate multiple JAR file names.
--master yarn
Specifies that Spark tasks are submitted on the Yarn cluster.
--deploy-mode cluster
Spark tasks are submitted in cluster mode.
--deploy-mode client
Spark tasks are submitted in client mode.
--driver-cores
Number of cores used by the driver process
--driver-memory
Memory used by the driver, which cannot exceed the total memory of a single node
--jars
JAR files required by algorithms. Enter the directories (including the file names) of the JAR files. Separate directories with commas (,).
--conf spark.executorEnv.LD_LIBRARY_PATH="XXX"
Sets LD_LIBRARY_PATH of the executor so that the path can be loaded to libboostkit_xgboost_kernel.so.
--conf spark.executor.extraLibraryPath="XXX"
Sets the LibraryPath of the executor and sets an extra executor to run the lib directory so that the driver can read the libboostkit_xgboost_kernel.so directory.
--conf spark.driver.extraLibraryPath="XXX"
Sets the LibraryPath of the driver and sets an extra driver to run the lib directory so that the executor can read the libboostkit_xgboost_kernel.so directory.
--files
When a Spark job is executed, the file configured in this parameter is copied to the workspace of the Spark compute node so that the libboostkit_xgboost_kernel.so file can be read.
./ml-test.jar
JAR file that is used as the test program
- Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows:
- If the XGBoost algorithm is running, refer to the following script content for the shell script content of a task (select either yarn-client or yarn-cluster mode).
- Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
#!/bin/bash spark-submit \ --class com.bigdata.ml.XGBTRunner\ --master yarn \ --deploy-mode client \ --driver-cores 36 \ --driver-memory 50g \ --jars "lib/boostkit-xgboost4j-spark-kernel_2.11-2.2.0-aarch_64.jar,lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar,lib/boostkit-xgboost4j_2.11-2.2.0.jar" \ --conf "spark.executor.extraClassPath=boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" \ --driver-class-path "lib/ml-test.jar:lib/boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:lib/boostkit-xgboost4j_2.11-2.2.0.jar" --conf spark.executorEnv.LD_LIBRARY_PATH="./lib/:${LD_LIBRARY_PATH}" \ --conf spark.executor.extraLibraryPath="./lib" \ --conf spark.driver.extraLibraryPath="./lib" \ --files=lib/libboostkit_xgboost_kernel.so \ ./ml-test.jar
- Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-cluster mode. An example of the shell script content is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
#!/bin/bash spark-submit \ --class com.bigdata.ml.XGBTRunner\ --master yarn \ --deploy-mode cluster \ --driver-cores 36 \ --driver-memory 50g \ --jars "lib/boostkit-xgboost4j-spark-kernel_2.11-2.2.0-aarch_64.jar,lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar,lib/boostkit-xgboost4j_2.11-2.2.0.jar" \ --conf "spark.executor.extraClassPath=boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" \ --driver-class-path "ml-test.jar:boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" --conf "spark.yarn.cluster.driver.extraClassPath=ml-test.jar:boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" --conf spark.executorEnv.LD_LIBRARY_PATH="./lib/:${LD_LIBRARY_PATH}" \ --conf spark.executor.extraLibraryPath="./lib" \ --conf spark.driver.extraLibraryPath="./lib" \ --files=lib/libboostkit_xgboost_kernel.so \ ./ml-test.jar
- By default, the logs generated during the running of algorithm packages are displayed on the client console and are not stored in files. You can import the customized log4j.properties file to save the logs to your local PC. For details, see Saving Run Logs to a Local PC.
- When submitting tasks in Spark single-node cluster mode, you are advised to enable identity authentication and disable the REST API to avoid Spark vulnerabilities.
- On different big data platforms, the paths specified by executorEnv.LD_LIBRARY_PATH, spark.executor.extraLibraryPath, and spark.driver.extraLibraryPath may be different from those in the example script. Set the parameters based on the actual scenario.
- For details about the script parameters, see Table 2.
- Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows:
For details about how to upgrade the algorithm library, see Upgrading the Algorithm Library.