Rate This Document
Findability
Accuracy
Completeness
Readability

Deployment Process

Move the software packages obtained in Compiling the Spark Algorithm Adaptation Package to the installation directories, as shown in Table 1. The /home/test/boostkit/ is used as an example installation root directory.
  • This section uses the machine learning algorithm library 2.2.0 version as an example.
  • The test tool JAR packages and scripts used in this section are for reference only. You need to develop them based on your specific requirements.
  • Install the algorithm packages only on the client. Do not install them on controller or compute nodes.
Table 1 Installation directories

Installation Directory

Components to Be Installed

/home/test/boostkit/lib

boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar

boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar

boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar

boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar

boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar

boostkit-xgboost4j_2.11-2.2.0.jar

libboostkit_xgboost_kernel.so

fastutil-8.3.1.jar (third-party open-source library)

/home/test/boostkit/

JAR test package

Shell script for task submission

Perform the following steps:

  1. From the client node, log in to the server as an authorized user of the big data component, install the third-party open-source library fastutil-8.3.1.jar on which the algorithms depend to the corresponding directory, and set the permission for the JAR file to 640.
    1. Go to /home/test/boostkit/lib/.
      1
      cd /home/test/boostkit/lib
      
    2. Download the fastutil-8.3.1.jar file.
      1
      wget https://repo1.maven.org/maven2/it/unimi/dsi/fastutil/8.3.1/fastutil-8.3.1.jar
      
    3. Change the permission on the JAR file.
      1
      chmod 640 fastutil-8.3.1.jar
      
  2. Copy the library adaptation package to the /home/test/boostkit/lib/ directory on the client and set the permission for the package to 550.
    1
    2
    3
    4
    5
    6
    cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-core/target/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar /home/test/boostkit/lib
    cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-accelerator/target/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar /home/test/boostkit/lib
    cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-xgboost/jvm-packages/boostkit-xgboost4j/target/boostkit-xgboost4j_2.11-2.2.0_aarch64.jar /home/test/boostkit/lib
    cp /opt/Spark-ml-algo-lib-v2.2.0-spark2.3.2/ml-xgboost/jvm-packages/boostkit-xgboost4j-spark/target/boostkit-xgboost4j-spark_2.11-2.2.0_aarch64.jar /home/test/boostkit/lib
    chmod 550 /home/test/boostkit/lib/boostkit-*
    chmod 550 /home/test/boostkit/lib/libboostkit_xgboost_kernel.so
    
  3. Save the JAR file (for example, ml-test.jar) of the algorithm test tool developed by yourself to the upper-level directory /home/test/boostkit/ of the library algorithm package files on the client.

    You can develop a test tool by yourself or use the test tool kal-test (https://gitee.com/kunpengcompute/Spark-ml-algo-lib/tree/master/tools/kal-test).

  4. If an algorithm other than XGBoost is running, refer to the following script content for the shell script content of a task (select either yarn-client or yarn-cluster mode).
    • Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      #!/bin/bash
      
      spark-submit \
      --class com.bigdata.ml.RFMain \
      --master yarn \
      --deploy-mode client \
      --driver-cores 36 \
      --driver-memory 50g \
      --jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
      --conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
      --driver-class-path "lib/ml-test.jar:lib/fastutil-8.3.1.jar:lib/snakeyaml-1.17.jar:lib/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:lib/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:lib/boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar"
      ./ml-test.jar 
      
    • Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-cluster mode. An example of the shell script content is as follows:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      #!/bin/bash
      
      spark-submit \
      --class com.bigdata.ml.RFMain \
      --master yarn \
      --deploy-mode cluster \
      --driver-cores 36 \
      --driver-memory 50g \
      --jars "lib/fastutil-8.3.1.jar,lib/boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar,lib/boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar,lib/boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
      --driver-class-path "ml-test.jar:fastutil-8.3.1.jar:snakeyaml-1.17.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
      --conf "spark.yarn.cluster.driver.extraClassPath=ml-test.jar:snakeyaml-1.17.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar" \
      --conf "spark.executor.extraClassPath=fastutil-8.3.1.jar:boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar:boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar" \
      ./ml-test.jar 
      
      • The XGBoost algorithm involves some C++ code. Therefore, the parameters for the XGBoost algorithm are slightly different from those for other algorithms. The preceding script can be used to submit jobs of algorithms other than XGBoost.
      • By default, the logs generated during the running of algorithm packages are displayed on the client console and are not stored in files. You can import the customized log4j.properties file to save the logs to your local PC. For details, see Saving Run Logs to a Local PC.
      • When submitting tasks in Spark single-node cluster mode, you are advised to enable identity authentication and disable the REST API to avoid Spark vulnerabilities.
      • On different big data platforms, the paths specified by executorEnv.LD_LIBRARY_PATH, spark.executor.extraLibraryPath, and spark.driver.extraLibraryPath may be different from those in the example script. Set the parameters based on the actual scenario.

      Table 2 describes the statements in the script.

      Table 2 Description of the statements in the script

      Statement

      Description

      spark-submit

      Specifies that jobs are submitted in spark-submit mode.

      --class com.bigdata.ml.RFMain

      Test program entry function for invoking algorithms

      --driver-class-path "XXX"

      Path (absolute path recommended) on the client for storing the following files:

      Files required by the algorithm library: boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, fastutil-8.3.1.jar, boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar, and boostkit-xgboost4j_2.11-2.2.0.jar.

      If Spark jobs are submitted in yarn-client mode, specify the directories (including the file names) of the JAR files to be referenced on the current node. Separate directories with colons (:).

      If Spark jobs are submitted in yarn-cluster mode, specify only the names of the JAR files. Separate JAR file names with colons (:).

      --conf "spark.executor.extraClassPath=XXX"

      JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil.

      Files required by the algorithm library: boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, fastutil-8.3.1.jar, boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar, and boostkit-xgboost4j_2.11-2.2.0.jar.

      --conf "spark.yarn.cluster.driver.extraClassPath=XXX"

      JAR files required by the machine learning algorithm library, algorithms, and the dependent third-party open source library fastutil.

      Files required by the algorithm library: boostkit-ml-acc_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-core_2.11-2.2.0-spark2.3.2.jar, boostkit-ml-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, fastutil-8.3.1.jar, boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar, boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar, and boostkit-xgboost4j_2.11-2.2.0.jar.

      This parameter needs to be configured only for Spark jobs in yarn-cluster mode. Enter the names of the referenced JAR files. Use colons (:) to separate multiple JAR file names.

      --master yarn

      Specifies that Spark tasks are submitted on the Yarn cluster.

      --deploy-mode cluster

      Spark tasks are submitted in cluster mode.

      --deploy-mode client

      Spark tasks are submitted in client mode.

      --driver-cores

      Number of cores used by the driver process

      --driver-memory

      Memory used by the driver, which cannot exceed the total memory of a single node

      --jars

      JAR files required by algorithms. Enter the directories (including the file names) of the JAR files. Separate directories with commas (,).

      --conf spark.executorEnv.LD_LIBRARY_PATH="XXX"

      Sets LD_LIBRARY_PATH of the executor so that the path can be loaded to libboostkit_xgboost_kernel.so.

      --conf spark.executor.extraLibraryPath="XXX"

      Sets the LibraryPath of the executor and sets an extra executor to run the lib directory so that the driver can read the libboostkit_xgboost_kernel.so directory.

      --conf spark.driver.extraLibraryPath="XXX"

      Sets the LibraryPath of the driver and sets an extra driver to run the lib directory so that the executor can read the libboostkit_xgboost_kernel.so directory.

      --files

      When a Spark job is executed, the file configured in this parameter is copied to the workspace of the Spark compute node so that the libboostkit_xgboost_kernel.so file can be read.

      ./ml-test.jar

      JAR file that is used as the test program

  5. If the XGBoost algorithm is running, refer to the following script content for the shell script content of a task (select either yarn-client or yarn-cluster mode).
    • Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-client mode. An example of the shell script content is as follows:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      #!/bin/bash
      
      spark-submit \
      --class com.bigdata.ml.XGBTRunner\
      --master yarn \
      --deploy-mode client \
      --driver-cores 36 \
      --driver-memory 50g \
      --jars "lib/boostkit-xgboost4j-spark-kernel_2.11-2.2.0-aarch_64.jar,lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar,lib/boostkit-xgboost4j_2.11-2.2.0.jar" \
      --conf "spark.executor.extraClassPath=boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" \
      --driver-class-path "lib/ml-test.jar:lib/boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:lib/boostkit-xgboost4j_2.11-2.2.0.jar"
      --conf spark.executorEnv.LD_LIBRARY_PATH="./lib/:${LD_LIBRARY_PATH}" \
      --conf spark.executor.extraLibraryPath="./lib" \
      --conf spark.driver.extraLibraryPath="./lib" \
      --files=lib/libboostkit_xgboost_kernel.so  \
      ./ml-test.jar 
      
    • Save the shell script for task submission in the /home/test/boostkit/ directory where the test JAR file is stored, and start the Spark job in yarn-cluster mode. An example of the shell script content is as follows:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      #!/bin/bash
      
      spark-submit \
      --class com.bigdata.ml.XGBTRunner\
      --master yarn \
      --deploy-mode cluster \
      --driver-cores 36 \
      --driver-memory 50g \
      --jars "lib/boostkit-xgboost4j-spark-kernel_2.11-2.2.0-aarch_64.jar,lib/boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar,lib/boostkit-xgboost4j_2.11-2.2.0.jar" \
      --conf "spark.executor.extraClassPath=boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar" \
      --driver-class-path "ml-test.jar:boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar"
      --conf "spark.yarn.cluster.driver.extraClassPath=ml-test.jar:boostkit-xgboost4j-kernel-2.11-2.2.0-spark2.3.2-aarch64.jar:boostkit-xgboost4j-spark2.3.2_2.11-2.2.0.jar:boostkit-xgboost4j_2.11-2.2.0.jar"
      --conf spark.executorEnv.LD_LIBRARY_PATH="./lib/:${LD_LIBRARY_PATH}" \
      --conf spark.executor.extraLibraryPath="./lib" \
      --conf spark.driver.extraLibraryPath="./lib" \
      --files=lib/libboostkit_xgboost_kernel.so  \
      ./ml-test.jar 
      
      • By default, the logs generated during the running of algorithm packages are displayed on the client console and are not stored in files. You can import the customized log4j.properties file to save the logs to your local PC. For details, see Saving Run Logs to a Local PC.
      • When submitting tasks in Spark single-node cluster mode, you are advised to enable identity authentication and disable the REST API to avoid Spark vulnerabilities.
      • On different big data platforms, the paths specified by executorEnv.LD_LIBRARY_PATH, spark.executor.extraLibraryPath, and spark.driver.extraLibraryPath may be different from those in the example script. Set the parameters based on the actual scenario.
      • For details about the script parameters, see Table 2.

For details about how to upgrade the algorithm library, see Upgrading the Algorithm Library.