我要评分
获取效率
正确性
完整性
易理解

Building Adaptation Code for the Machine Learning Algorithm Library

  • The process of building the adaptation code Spark-ml-algo-lib for the machine learning algorithm library is as follows. This section uses the build process that adapts to the Spark 2.3.2 code as an example. The process that adapts to the Spark 2.4.6 and Spark 3.1.1 code is similar.
  • Perform the following operations in the Linux environment. This section is for reference only.
  1. Download the Spark 2.3.2 source code ZIP file to the /opt/ directory and decompress it. The Spark source code directory is generated.

    Download URL: https://github.com/apache/spark/archive/v2.3.2.zip

    1
    wget https://github.com/apache/spark/archive/v2.3.2.zip
    
  2. Download the Breeze 0.13.1 source code ZIP file to the /opt/ directory and decompress it. The Breeze source code directory is generated.

    Download URL: https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip

    1
    wget https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip
    
  3. Download the XGBoost 1.1.0 source code ZIP file to the /opt/ directory and decompress it. The XGBoost source code directory is generated.
    Download URL: https://github.com/dmlc/xgboost/archive/refs/tags/v1.1.0.zip
    1
    wget https://github.com/dmlc/xgboost/archive/refs/tags/v1.1.0.zip
    
  4. Obtain the CUB source package and decompress it in the /opt/xgboost-1.1.0 directory to obtain the CUB source directory /opt/xgboost-1.1.0/cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad. Then, delete the /opt/xgboost-1.1.0/cub directory, rename the /opt/xgboost-1.1.0/cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad directory as /opt/xgboost-1.1.0/cub.
    1
    2
    3
    4
    wget https://github.com/NVlabs/cub/archive/b20808b1b04ec3d6a625e51fbc1eb76f337754ad.zip
    unzip cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad.zip
    rm -rf cub
    mv cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad cub
    
  5. Obtain the dmlc-core source package and decompress it in the /opt/xgboost-1.1.0 directory to obtain the dmlc-core source directory /opt/xgboost-1.1.0/dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407. Then delete the /opt/xgboost-1.1.0/dmlc-core directory and rename the /opt/xgboost-1.1.0/dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407 directory as /opt/xgboost-1.1.0/dmlc-core.
    1
    2
    3
    4
    wget https://github.com/dmlc/dmlc-core/archive/5df8305fe699d3b503d10c60a231ab0223142407.zip
    unzip dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407.zip
    rm -rf dmlc-core
    mv dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407 dmlc-core
    
  6. Obtain the Rabit source package and decompress it in the /opt/xgboost-1.1.0 directory to obtain the /opt/xgboost-1.1.0/rabit-4fb34a008db6437c84d1877635064e09a55c8553 directory. Then, delete the /opt/xgboost-1.1.0/rabit directory, rename the /opt/xgboost-1.1.0/rabit-4fb34a008db6437c84d1877635064e09a55c8553 directory as /opt/xgboost-1.1.0/rabit.
    1
    2
    3
    4
    wget https://github.com/dmlc/rabit/archive/4fb34a008db6437c84d1877635064e09a55c8553.zip
    unzip rabit-4fb34a008db6437c84d1877635064e09a55c8553.zip
    rm -rf rabit
    mv rabit-4fb34a008db6437c84d1877635064e09a55c8553 rabit
    
  7. Download the Netlib source package to the /opt/ directory and decompress it. The netlib-2.2.1 source code directory is generated.

    Download link: https://github.com/luhenry/netlib/archive/refs/tags/v2.2.1.zip

  8. In the /opt/ directory, create a project named Spark-ml-algo-lib with the following directory structure.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    cd /opt/
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/breeze/optimize
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/feature
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/aggregator
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/loss
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/recommendation
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/regression
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/stat
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tree/impl
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/stat/correlation
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tree/impl
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/blas
    cp -r xgboost-1.1.0 Spark-ml-algo-lib/ml-xgboost
    
  9. Copy the original files in Spark 2.3.2 and Breeze 0.13.1 to the Spark-ml-algo-lib directories according to the mapping in Table 1 and Table 2. Delete unnecessary native code of XGBoost according to 1, and copy the remaining code to the Spark-ml-algo-lib/ml-xgboost directory. Change the names of some folders according to Table 4. The first column lists the names of the current directories, and the second column lists the names of the modified directories. The following provides two sample commands for copying files to the destination directories.

    Some files need to be renamed after being copied to the destination folders.

    Sample commands:
    1
    2
    cp /opt/spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala /opt/Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
    cp /opt/breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala /opt/Bigdata_ML_ALGO_ACC_LIB/ml-accelerator/src/main/scala/breeze/optimize/FirstOrderMinimizerX.scala
    
    Table 1 Spark files required in the Spark-ml-algo-lib project

    Directory in the Spark-ml-algo-lib Project

    File Name in the Spark-ml-algo-lib Project

    Original Directory in Spark

    Original File Name in Spark

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/classification/

    GBTClassifier.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/

    GBTClassifier.scala

    LinearSVC.scala

    LinearSVC.scala

    RandomForestClassifier.scala

    RandomForestClassifier.scala

    DecisionTreeClassifier.scala

    DecisionTreeClassifier.scala

    LogisticRegression.scala

    LogisticRegression.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/feature

    IDF.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/feature

    IDF.scala

    Word2Vec.scala

    Word2Vec.scala

    DecisionTreeBucketizer.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification

    RandomForestClassifier.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/aggregator/

    DifferentiableLossAggregatorX.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/

    DifferentiableLossAggregator.scala

    HingeAggregatorX.scala

    HingeAggregator.scala

    HuberAggregatorX.scala

    HuberAggregator.scala

    LeastSquaresAggregatorX.scala

    LeastSquaresAggregator.scala

    LogisticAggregatorX.scala

    LogisticAggregator.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/loss/

    RDDLossFunctionX.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/optim/loss/

    RDDLossFunction.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/recommendation/

    ALS.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/recommendation

    ALS.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/regression/

    DecisionTreeRegressor.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/regression/

    DecisionTreeRegressor.scala

    GBTRegressor.scala

    GBTRegressor.scala

    LinearRegression.scala

    LinearRegression.scala

    RandomForestRegressor.scala

    RandomForestRegressor.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/stat/

    Correlation.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/stat/

    Correlation.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/impl/

    GradientBoostedTrees.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/impl/

    GradientBoostedTrees.scala

    NodeIdCache.scala

    NodeIdCache.scala

    RandomForest.scala

    RandomForest.scala

    RandomForest4GBDTX.scala

    RandomForest.scala

    RandomForestRaw.scala

    RandomForest.scala

    DecisionForest.scala

    RandomForest.scala

    DecisionTreeBucket.scala

    RandomForest.scala

    DecisionTreeMetadata.scala

    DecisionTreeMetadata.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/

    treeParams.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/

    treeParams.scala

    treeModels.scala

    treeModels.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering/

    KMACCm.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/clustering

    KMeans.scala

    KMeans.scala

    KMeans.scala

    LDA.scala

    LDA.scala

    LDAOptimizer.scala

    LDAOptimizer.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature

    IDF.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/feature/

    IDF.scala

    Word2Vec.scala

    Word2Vec.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm/

    PrefixSpan.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/fpm

    PrefixSpan.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed/

    RowMatrix.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed

    RowMatrix.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/

    EigenValueDecomposition.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/linalg

    EigenValueDecomposition.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/stat/correlation/

    Correlation.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/

    Correlation.scala

    PearsonCorrelation.scala

    PearsonCorrelation.scala

    SpearmanCorrelation.scala

    SpearmanCorrelation.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree/

    DecisionTree.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree

    DecisionTree.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/

    Node.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/

    Node.scala

    Split.scala

    Split.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/impl

    BaggedPoint.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/impl/

    BaggedPoint.scala

    DTFeatureStatsAggregator.scala

    DTStatsAggregator.scala

    DTStatsAggregator.scala

    DTStatsAggregator.scala

    GradientBoostedTreesCore.scala

    RandomForest.scala

    TreePointX.scala

    TreePoint.scala

    TreePointY.scala

    TreePoint.scala

    Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering/

    LDAUtilsX.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/clustering

    LDAUtils.scala

    OnlineLDAOptimizerXObj.scala

    LDAOptimizer.scala

    Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm/

    LocalPrefixSpan.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/fpm/

    LocalPrefixSpan.scala

    PrefixSpanBase.scala

    PrefixSpan.scala

    Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity/

    Entropy.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity

    Entropy.scala

    Gini.scala

    Gini.scala

    Impurities.scala

    Impurities.scala

    Impurity.scala

    Impurity.scala

    Variance.scala

    Variance.scala

    Table 2 Breeze files required in the Spark-ml-algo-lib project

    Directory in the Spark-ml-algo-lib Project

    File Name in the Spark-ml-algo-lib Project

    Original Directory in Breeze

    Original File Name in Breeze

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/breeze/optimize

    FirstOrderMinimizerX.scala

    breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize

    FirstOrderMinimizer.scala

    LBFGSX.scala

    LBFGS.scala

    OWLQNX.scala

    OWLQN.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/breeze/numerics/

    DigammaX.scala

    breeze-releases-v0.13.1/math/src/main/scala/breeze/numerics/

    package.scala

    Table 3 Netlib files required in the Spark-ml-algo-lib project

    Directory in the Spark-ml-algo-lib Project

    File Name in the Spark-ml-algo-lib Project

    Original Directory in Netlib

    Original File Name in Netlib

    Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/

    BLAS.java

    netlib-2.2.1/blas/src/main/java/dev/ludovic/netlib

    BLAS.java

    InstanceBuilder.java

    InstanceBuilder.java

    JavaBLAS.java

    JavaBLAS.java

    NativeBLAS.java

    NativeBLAS.java

    Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/blas

    AbstractBLAS.java

    netlib-2.2.1/blas/src/main/java/dev/ludovic/netlib/blas

    AbstractBLAS.java

    F2jBLAS.java

    F2jBLAS.java

    JNIBLAS.java

    JNIBLAS.java

    Java8BLAS.java

    Java8BLAS.java

    Table 4 Directories whose name need to be changed

    Directory in the Spark-ml-algo-lib Project

    New Name of the Directory.

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-example

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-example

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-flink

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-flink

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-spark

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-spark

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-tester

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-tester

    Files or directories to be deleted from the XGBoost native code:

    • xgboost-1.1.0/.github
    • xgboost-1.1.0/cub/.settings
    • xgboost-1.1.0/cub/.project
    • xgboost-1.1.0/dmlc-core/.github
    • xgboost-1.1.0/dmlc-core/make/config.mk
    • xgboost-1.1.0/dmlc-core/test/unittest/sample.rec
    • xgboost-1.1.0/doc/_static
    • xgboost-1.1.0/rabit/lib
    • xgboost-1.1.0/R-package/data
    • xgboost-1.1.0/.gitignore
  10. Download the patch to the /opt/Spark-ml-algo-lib/ directory. Take Spark 2.3.2 as an example. Integrate the patch of Spark 2.3.2 into Spark-ml-algo-lib to obtain the complete adaptation code Spark-ml-algo-lib of the machine learning algorithm library.
    1
    2
    3
    cd /opt/Spark-ml-algo-lib
    wget https://github.com/kunpengcompute/Spark-ml-algo-lib/releases/download/v2.2.0-spark2.3.2/Spark-ml-algo-lib-Spark2.3.2.patch
    patch -p1 < Spark-ml-algo-lib-Spark2.3.2.patch
    

    The directory structure of the complete adaptation code Spark-ml-algo-lib of the machine learning algorithm library is the same as that in the repository.