Rate This Document
Findability
Accuracy
Completeness
Readability

Building Adaptation Code for the Spark Machine Learning Algorithm Library

Building the adaptation code Spark-ml-algo-lib

  1. Download the Spark 2.3.2 source code ZIP file to the /opt/ directory and decompress it. The Spark source code directory /opt/spark-2.3.2 is generated.

    Download link: https://github.com/apache/spark/archive/v2.3.2.zip

    wget https://github.com/apache/spark/archive/v2.3.2.zip unzip v2.3.2.zip
  2. Download the Breeze 0.13.1 source code ZIP file to the /opt/ directory and decompress it. The Breeze source code directory /opt/breeze-releases-v0.13.1 is generated.

    Download link: https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip

    wget https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip unzip v0.13.1.zip
  3. In the /opt/ directory, create a project named Spark-ml-algo-lib with the following directory structure.

    cd /opt/
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/breeze/optimize
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/aggregator
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/loss
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/recommendation
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/regression
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tree/impl
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tree/impl
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity
  4. Copy the original files in Spark 2.3.2 and Breeze 0.13.1 to the Spark-ml-algo-lib directories according to the mapping in Table 1 and Table 2. The following provides two sample commands for copying files to the destination directories.

    Some files need to be renamed after being copied to the destination folders.

    Sample commands:
    1
    2
    cp /opt/spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala /opt/Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
    cp /opt/breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala /opt/Bigdata_ML_ALGO_ACC_LIB/ml-accelerator/src/main/scala/breeze/optimize/FirstOrderMinimizerX.scala
    
    Table 1 Spark files required in the Spark-ml-algo-lib project

    Directory in the Spark-ml-algo-lib Project

    File Name in the Spark-ml-algo-lib Project

    Original Directory in Spark

    Original File Name in Spark

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/classification/

    GBTClassifier.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/classification/

    GBTClassifier.scala

    LinearSVC.scala

    LinearSVC.scala

    RandomForestClassifier.scala

    RandomForestClassifier.scala

    DecisionTreeClassifier.scala

    DecisionTreeClassifier.scala

    LogisticRegression.scala

    LogisticRegression.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/aggregator/

    DifferentiableLossAggregatorX.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/aggregator/

    DifferentiableLossAggregator.scala

    HingeAggregatorX.scala

    HingeAggregator.scala

    HuberAggregatorX.scala

    HuberAggregator.scala

    LeastSquaresAggregatorX.scala

    LeastSquaresAggregator.scala

    LogisticAggregatorX.scala

    LogisticAggregator.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/loss/

    RDDLossFunctionX.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/loss/

    RDDLossFunction.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/regression/

    DecisionTreeRegressor.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/loss/

    DecisionTreeRegressor.scala

    GBTRegressor.scala

    GBTRegressor.scala

    LinearRegression.scala

    LinearRegression.scala

    RandomForestRegressor.scala

    RandomForestRegressor.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/impl/

    GradientBoostedTrees.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/impl/

    GradientBoostedTrees.scala

    NodeIdCache.scala

    NodeIdCache.scala

    RandomForest.scala

    RandomForest.scala

    RandomForest4GBDTX.scala

    RandomForest.scala

    RandomForestRaw.scala

    RandomForest.scala

    DecisionForest.scala

    RandomForest.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/

    treeParams.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/

    treeParams.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering/

    KMACCm.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/clustering

    KMeans.scala

    KMeans.scala

    KMeans.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed/

    RowMatrix.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/linalg/distributed

    RowMatrix.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/

    EigenValueDecomposition.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/linalg

    EigenValueDecomposition.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree/

    DecisionTree.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/tree

    DecisionTree.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/

    Node.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/

    Node.scala

    Split.scala

    Split.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/impl

    BaggedPoint.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/impl/

    BaggedPoint.scala

    DTFeatureStatsAggregator.scala

    DTStatsAggregator.scala

    DTStatsAggregator.scala

    DTStatsAggregator.scala

    GradientBoostedTreesCore.scala

    RandomForest.scala

    TreePointX.scala

    TreePoint.scala

    TreePointY.scala

    TreePoint.scala

    Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity/

    Entropy.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity

    Entropy.scala

    Gini.scala

    Gini.scala

    Impurities.scala

    Impurities.scala

    Impurity.scala

    Impurity.scala

    Variance.scala

    Variance.scala

    Table 2 Breeze files required in the Spark-ml-algo-lib project

    Directory in the Spark-ml-algo-lib Project

    File Name in the Spark-ml-algo-lib Project

    Original Directory in Breeze

    Original File Name in Breeze

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/breeze/optimize

    FirstOrderMinimizerX.scala

    breeze-releases-v0.13.1/math/src/ main/scala/breeze/optimize

    FirstOrderMinimizer.scala

    LBFGSX.scala

    LBFGS.scala

    OWLQNX.scala

    OWLQN.scala

    After operations in 4, the directory structure of the Spark-ml-algo-lib project and the files in the directory are as follows:

    Spark-ml-algo-lib
    ├── ml-accelerator
    │   └── src
    │       └── main
    │           └── scala
    │               ├── breeze
    │               │   └── optimize
    │               │       ├── FirstOrderMinimizerX.scala
    │               │       ├── LBFGSX.scala
    │               │       └── OWLQNX.scala
    │               └── org
    │                   └── apache
    │                       └── spark
    │                           ├── ml
    │                           │   ├── classification
    │                           │   │   ├── DecisionTreeClassifier.scala
    │                           │   │   ├── GBTClassifier.scala
    │                           │   │   ├── LinearSVC.scala
    │                           │   │   ├── LogisticRegression.scala
    │                           │   │   └── RandomForestClassifier.scala
    │                           │   ├── optim
    │                           │   │   ├── aggregator
    │                           │   │   │   ├── DifferentiableLossAggregatorX.scala
    │                           │   │   │   ├── HingeAggregatorX.scala
    │                           │   │   │   ├── HuberAggregatorX.scala
    │                           │   │   │   ├── LeastSquaresAggregatorX.scala
    │                           │   │   │   └── LogisticAggregatorX.scala
    │                           │   │   └── loss
    │                           │   │       └── RDDLossFunctionX.scala
    │                           │   ├── regression
    │                           │   │   ├── DecisionTreeRegressor.scala
    │                           │   │   ├── GBTRegressor.scala
    │                           │   │   ├── LinearRegression.scala
    │                           │   │   └── RandomForestRegressor.scala
    │                           │   └── tree
    │                           │       ├── impl
    │                           │       │   ├── DecisionForest.scala
    │                           │       │   ├── GradientBoostedTrees.scala
    │                           │       │   ├── NodeIdCache.scala
    │                           │       │   ├── RandomForest4GBDTX.scala
    │                           │       │   ├── RandomForestRaw.scala
    │                           │       │   └── RandomForest.scala
    │                           │       └── treeParams.scala
    │                           └── mllib
    │                               ├── clustering
    │                               │   ├── KMACCm.scala
    │                               │   └── KMeans.scala
    │                               ├── linalg
    │                               │   ├── distributed
    │                               │   │   └── RowMatrix.scala
    │                               │   └── EigenValueDecomposition.scala
    │                               └── tree
    │                                   └── DecisionTree.scala
    └── ml-core
        └── src
            └── main
                └── scala
                    └── org
                        └── apache
                            └── spark
                                ├── ml
                                │   └── tree
                                │       ├── impl
                                │       │   ├── BaggedPoint.scala
                                │       │   ├── DTFeatureStatsAggregator.scala
                                │       │   ├── DTStatsAggregator.scala
                                │       │   ├── GradientBoostedTreesCore.scala
                                │       │   ├── TreePointX.scala
                                │       │   └── TreePointY.scala
                                │       ├── Node.scala
                                │       └── Split.scala
                                └── mllib
                                    └── tree
                                        └── impurity
                                            ├── Entropy.scala
                                            ├── Gini.scala
                                            ├── Impurities.scala
                                            ├── Impurity.scala
                                            └── Variance.scala
  5. Download Spark-ml-algo-lib.patch to the /opt/Spark-ml-algo-lib/ directory, decompress the patch package, and import it to Spark-ml-algo-lib, to obtain the complete adaptation code Spark-ml-algo-lib of the machine learning algorithm library.
    1
    2
    3
    cd /opt/Spark-ml-algo-lib
    wget https://github.com/kunpengcompute/Spark-ml-algo-lib/releases/download/v1.1.0/Spark-ml-algo-lib.patch
    patch -p1 < Spark-ml-algo-lib.patch
    

    The directory structure of the complete adaptation code Spark-ml-algo-lib of the machine learning algorithm library and the files in the directory are as follows:

    Spark-ml-algo-lib
    ├── LICENSE
    ├── ml-accelerator
    │   ├── pom.xml
    │   └── src
    │       └── main
    │           └── scala
    │               ├── breeze
    │               │   └── optimize
    │               │       ├── FirstOrderMinimizerX.scala
    │               │       ├── LBFGSX.scala
    │               │       └── OWLQNX.scala
    │               └── org
    │                   └── apache
    │                       └── spark
    │                           ├── ml
    │                           │   ├── classification
    │                           │   │   ├── DecisionTreeClassifier.scala
    │                           │   │   ├── GBTClassifier.scala
    │                           │   │   ├── LinearSVC.scala
    │                           │   │   ├── LogisticRegression.scala
    │                           │   │   └── RandomForestClassifier.scala
    │                           │   ├── optim
    │                           │   │   ├── aggregator
    │                           │   │   │   ├── DifferentiableLossAggregatorX.scala
    │                           │   │   │   ├── HingeAggregatorX.scala
    │                           │   │   │   ├── HuberAggregatorX.scala
    │                           │   │   │   ├── LeastSquaresAggregatorX.scala
    │                           │   │   │   └── LogisticAggregatorX.scala
    │                           │   │   └── loss
    │                           │   │       └── RDDLossFunctionX.scala
    │                           │   ├── regression
    │                           │   │   ├── DecisionTreeRegressor.scala
    │                           │   │   ├── GBTRegressor.scala
    │                           │   │   ├── LinearRegression.scala
    │                           │   │   └── RandomForestRegressor.scala
    │                           │   └── tree
    │                           │       ├── impl
    │                           │       │   ├── DecisionForest.scala
    │                           │       │   ├── GradientBoostedTrees.scala
    │                           │       │   ├── NodeIdCache.scala
    │                           │       │   ├── RandomForest4GBDTX.scala
    │                           │       │   ├── RandomForestRaw.scala
    │                           │       │   └── RandomForest.scala
    │                           │       └── treeParams.scala
    │                           └── mllib
    │                               ├── clustering
    │                               │   ├── KMACCm.scala
    │                               │   └── KMeans.scala
    │                               ├── linalg
    │                               │   ├── distributed
    │                               │   │   └── RowMatrix.scala
    │                               │   └── EigenValueDecomposition.scala
    │                               └── tree
    │                                   └── DecisionTree.scala
    ├── ml-core
    │   ├── pom.xml
    │   └── src
    │       └── main
    │           └── scala
    │               └── org
    │                   └── apache
    │                       └── spark
    │                           ├── ml
    │                           │   └── tree
    │                           │       ├── impl
    │                           │       │   ├── BaggedPoint.scala
    │                           │       │   ├── DTFeatureStatsAggregator.scala
    │                           │       │   ├── DTStatsAggregator.scala
    │                           │       │   ├── GradientBoostedTreesCore.scala
    │                           │       │   ├── TreePointX.scala
    │                           │       │   └── TreePointY.scala
    │                           │       ├── Node.scala
    │                           │       └── Split.scala
    │                           └── mllib
    │                               └── tree
    │                                   └── impurity
    │                                       ├── Entropy.scala
    │                                       ├── Gini.scala
    │                                       ├── Impurities.scala
    │                                       ├── Impurity.scala
    │                                       └── Variance.scala
    ├── ml-kernel-client
    │   ├── pom.xml
    │   └── src
    │       └── main
    │           └── scala
    │               ├── breeze
    │               │   ├── linalg
    │               │   │   ├── blas
    │               │   │   │   ├── Dgemv.scala
    │               │   │   │   └── Gramian.scala
    │               │   │   ├── DenseMatrixUtil.scala
    │               │   │   ├── DenseVectorUtil.scala
    │               │   │   └── lapack
    │               │   │       └── EigenDecomposition.scala
    │               │   └── optimize
    │               │       ├── ACC.scala
    │               │       ├── LBFGSL.scala
    │               │       └── OWLQNL.scala
    │               └── org
    │                   └── apache
    │                       └── spark
    │                           ├── ml
    │                           │   └── tree
    │                           │       └── impl
    │                           │           ├── DTUtils.scala
    │                           │           ├── GradientBoostedTreesUtil.scala
    │                           │           └── RFUtils.scala
    │                           ├── mllib.clustering
    │                           │   └── KmeansUtil.scala
    │                           └── mllib.linalg.distributed
    │                               └── RowMatrixUtil.scala
    ├── pom.xml
    ├── README.md
    └── scalastyle-config.xml