Upgrading the Algorithm Library
- From the client node, log in to the server as an authorized user of the big data component. Delete the algorithm library JAR package from the /home/test/boostkit/lib/ directory on the client.
1 2
rm -f /home/test/boostkit/lib/boostkit-* rm -f /home/test/boostkit/lib/libboostkit-*
- Obtain the algorithm library adaptation package following the instructions described in Compiling the Code.
- Obtain the core JAR package of the algorithm library following the instructions described in Obtaining the Core JAR File of the Machine Learning Algorithm Library.
- Copy the algorithm library adaptation package obtained in 2 to /home/test/boostkit/lib/.
1 2 3
cp /opt/Spark-ml-algo-lib-3.0.0-spark3.3.1/ml-core/target/boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar /home/test/boostkit/lib cp /opt/Spark-ml-algo-lib-3.0.0-spark3.3.1/ml-accelerator/target/boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar /home/test/boostkit/lib cp /opt/boostkit-ml-kernel_2.12-3.0.0-spark3.3.1-aarch64.jar /home/test/boostkit/lib
- To run algorithms, refer to the following script content for the shell script content of a task (select either yarn-client or yarn-cluster mode).
- Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows:
1 2 3 4 5 6 7 8 9 10 11 12
#!/bin/bash spark-submit \ --class com.bigdata.ml.RFMain \ --master yarn \ --deploy-mode client \ --driver-cores 36 \ --driver-memory 50g \ --jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar,lib/boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar,lib/boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \ --conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \ --driver-class-path "lib/ml-test.jar:lib/fastutil-8.3.1.jar:lib/snakeyaml-1.17.jar:lib/boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:lib/boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar:lib/boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" ./ml-test.jar
- Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-cluster mode. An example of the shell script content is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13
#!/bin/bash spark-submit \ --class com.bigdata.ml.RFMain \ --master yarn \ --deploy-mode cluster \ --driver-cores 36 \ --driver-memory 50g \ --jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar,lib/boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar,lib/boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar,lib/boostkit-xgboost4j-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \ --driver-class-path "ml-test.jar:fastutil-8.3.1.jar:snakeyaml-1.17.jar:boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \ --conf "spark.yarn.cluster.driver.extraClassPath=ml-test.jar:snakeyaml-1.17.jar:boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar:boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar" \ --conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \ ./ml-test.jar
- By default, the logs generated during the running of algorithm packages are displayed on the client console and are not stored in files. You can import the customized log4j.properties file to save the logs to your local PC. For details, see Saving Run Logs to a Local PC.
- When submitting tasks in Spark single-node cluster mode, you are advised to enable identity authentication and disable the REST API to avoid Spark vulnerabilities.
- On different big data platforms, the paths specified by executorEnv.LD_LIBRARY_PATH, spark.executor.extraLibraryPath, and spark.driver.extraLibraryPath may be different from those in the example script. Set the parameters based on the actual scenario.
Table 1 describes the statements in the script.
Table 1 Description of the statements in the script Statement
Description
spark-submit
Specifies that jobs are submitted in spark-submit mode.
--class com.bigdata.ml.RFMain
Test program entry function for invoking algorithms
--driver-class-path "XXX"
Path (absolute path recommended) on the client for storing the following files:
The machine learning algorithm library requires the following packages: boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar, boostkit-ml-kernel-client_2.12-3.0.0-spark3.3.1.jar, and fastutil-8.3.1.jar.
If Spark jobs are submitted in yarn-client mode, specify the directories (including the file names) of the JAR files to be referenced on the current node. Separate directories with colons (:).
If Spark jobs are submitted in yarn-cluster mode, specify only the names of the JAR files. Separate JAR file names with colons (:).
--conf "spark.executor.extraClassPath=XXX"
JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil.
The machine learning algorithm library requires the following packages: boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar, boostkit-ml-kernel-client_2.12-3.0.0-spark3.3.1.jar, and fastutil-8.3.1.jar.
--conf "spark.yarn.cluster.driver.extraClassPath=XXX"
JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil.
The machine learning algorithm library requires the following packages: boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar, boostkit-ml-kernel-client_2.12-3.0.0-spark3.3.1.jar, and fastutil-8.3.1.jar.
This parameter needs to be configured only for Spark jobs in yarn-cluster mode. Enter the names of the referenced JAR files. Use colons (:) to separate multiple JAR file names.
--master yarn
Specifies that Spark tasks are submitted on the Yarn cluster.
--deploy-mode cluster
Spark tasks are submitted in cluster mode.
--deploy-mode client
Spark tasks are submitted in client mode.
--driver-cores
Number of cores used by the driver process
--driver-memory
Memory used by the driver, which cannot exceed the total memory of a single node
--jars
JAR files required by algorithms. Enter the directories (including the file names) of the JAR files. Separate directories with commas (,).
--conf spark.executorEnv.LD_LIBRARY_PATH="XXX"
Sets LD_LIBRARY_PATH for the executor.
--conf spark.executor.extraLibraryPath="XXX"
Sets LibraryPath for the executor.
--conf spark.driver.extraLibraryPath="XXX"
Sets LibraryPath for the driver.
--files
When a Spark job is executed, the file configured in this parameter is copied to the workspace of the Spark compute node so that libboostkit_xgboost_kernel.so and libboostkit_lightgbm_close.so can be read.
./ml-test.jar
JAR file that is used as the test program
- Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows: