Rate This Document
Findability
Accuracy
Completeness
Readability

Upgrading the Algorithm Library

  1. From the client node, log in to the server as an authorized user of the big data component. Delete the algorithm library JAR package from the /home/test/boostkit/lib/ directory on the client.
    1
    2
    rm -f /home/test/boostkit/lib/boostkit-*
    rm -f /home/test/boostkit/lib/libboostkit-*
    
  2. Obtain the algorithm library adaptation package following the instructions described in Compiling the Code.
  3. Obtain the core JAR package of the algorithm library following the instructions described in Obtaining the Core JAR File of the Machine Learning Algorithm Library.
  4. Copy the algorithm library adaptation package obtained in 2 to /home/test/boostkit/lib/.
    1
    2
    3
    cp /opt/Spark-ml-algo-lib-3.0.0-spark3.3.1/ml-core/target/boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar /home/test/boostkit/lib
    cp /opt/Spark-ml-algo-lib-3.0.0-spark3.3.1/ml-accelerator/target/boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar /home/test/boostkit/lib
    cp /opt/boostkit-ml-kernel_2.12-3.0.0-spark3.3.1-aarch64.jar /home/test/boostkit/lib
    
  5. To run algorithms, refer to the following script content for the shell script content of a task (select either yarn-client or yarn-cluster mode).
    • Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      #!/bin/bash
      
      spark-submit \
      --class com.bigdata.ml.RFMain \
      --master yarn \
      --deploy-mode client \
      --driver-cores 36 \
      --driver-memory 50g \
      --jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar,lib/boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar,lib/boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \
      --conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \
      --driver-class-path "lib/ml-test.jar:lib/fastutil-8.3.1.jar:lib/snakeyaml-1.17.jar:lib/boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:lib/boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar:lib/boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar"
      ./ml-test.jar 
      
    • Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-cluster mode. An example of the shell script content is as follows:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      #!/bin/bash
      
      spark-submit \
      --class com.bigdata.ml.RFMain \
      --master yarn \
      --deploy-mode cluster \
      --driver-cores 36 \
      --driver-memory 50g \
      --jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar,lib/boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar,lib/boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar,lib/boostkit-xgboost4j-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \
      --driver-class-path "ml-test.jar:fastutil-8.3.1.jar:snakeyaml-1.17.jar:boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \
      --conf "spark.yarn.cluster.driver.extraClassPath=ml-test.jar:snakeyaml-1.17.jar:boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar:boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar" \
      --conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar:boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar" \
      ./ml-test.jar 
      
      • By default, the logs generated during the running of algorithm packages are displayed on the client console and are not stored in files. You can import the customized log4j.properties file to save the logs to your local PC. For details, see Saving Run Logs to a Local PC.
      • When submitting tasks in Spark single-node cluster mode, you are advised to enable identity authentication and disable the REST API to avoid Spark vulnerabilities.
      • On different big data platforms, the paths specified by executorEnv.LD_LIBRARY_PATH, spark.executor.extraLibraryPath, and spark.driver.extraLibraryPath may be different from those in the example script. Set the parameters based on the actual scenario.

      Table 1 describes the statements in the script.

      Table 1 Description of the statements in the script

      Statement

      Description

      spark-submit

      Specifies that jobs are submitted in spark-submit mode.

      --class com.bigdata.ml.RFMain

      Test program entry function for invoking algorithms

      --driver-class-path "XXX"

      Path (absolute path recommended) on the client for storing the following files:

      The machine learning algorithm library requires the following packages: boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar, boostkit-ml-kernel-client_2.12-3.0.0-spark3.3.1.jar, and fastutil-8.3.1.jar.

      If Spark jobs are submitted in yarn-client mode, specify the directories (including the file names) of the JAR files to be referenced on the current node. Separate directories with colons (:).

      If Spark jobs are submitted in yarn-cluster mode, specify only the names of the JAR files. Separate JAR file names with colons (:).

      --conf "spark.executor.extraClassPath=XXX"

      JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil.

      The machine learning algorithm library requires the following packages: boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar, boostkit-ml-kernel-client_2.12-3.0.0-spark3.3.1.jar, and fastutil-8.3.1.jar.

      --conf "spark.yarn.cluster.driver.extraClassPath=XXX"

      JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil.

      The machine learning algorithm library requires the following packages: boostkit-ml-acc_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-core_2.12-3.0.0-spark3.3.1.jar, boostkit-ml-kernel-2.12-3.0.0-spark3.3.1-aarch64.jar, boostkit-ml-kernel-client_2.12-3.0.0-spark3.3.1.jar, and fastutil-8.3.1.jar.

      This parameter needs to be configured only for Spark jobs in yarn-cluster mode. Enter the names of the referenced JAR files. Use colons (:) to separate multiple JAR file names.

      --master yarn

      Specifies that Spark tasks are submitted on the Yarn cluster.

      --deploy-mode cluster

      Spark tasks are submitted in cluster mode.

      --deploy-mode client

      Spark tasks are submitted in client mode.

      --driver-cores

      Number of cores used by the driver process

      --driver-memory

      Memory used by the driver, which cannot exceed the total memory of a single node

      --jars

      JAR files required by algorithms. Enter the directories (including the file names) of the JAR files. Separate directories with commas (,).

      --conf spark.executorEnv.LD_LIBRARY_PATH="XXX"

      Sets LD_LIBRARY_PATH for the executor.

      --conf spark.executor.extraLibraryPath="XXX"

      Sets LibraryPath for the executor.

      --conf spark.driver.extraLibraryPath="XXX"

      Sets LibraryPath for the driver.

      --files

      When a Spark job is executed, the file configured in this parameter is copied to the workspace of the Spark compute node so that libboostkit_xgboost_kernel.so and libboostkit_lightgbm_close.so can be read.

      ./ml-test.jar

      JAR file that is used as the test program