Building Adaptation Code for the Spark Machine Learning Algorithm Library
Building the adaptation code Spark-ml-algo-lib
- Download the Spark 2.3.2 source code ZIP file to the /opt/ directory and decompress it. The Spark source code directory /opt/spark-2.3.2 is generated.
Download link: https://github.com/apache/spark/archive/v2.3.2.zip
wget https://github.com/apache/spark/archive/v2.3.2.zip unzip v2.3.2.zip
- Download the Breeze 0.13.1 source code ZIP file to the /opt/ directory and decompress it. The Breeze source code directory /opt/breeze-releases-v0.13.1 is generated.
Download link: https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip
wget https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip unzip v0.13.1.zip
- In the /opt/ directory, create a project named Spark-ml-algo-lib with the following directory structure.

cd /opt/ mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/breeze/optimize mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/aggregator mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/loss mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/recommendation mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/regression mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity
- Copy the original files in Spark 2.3.2 and Breeze 0.13.1 to the Spark-ml-algo-lib directories according to the mapping in Table 1 and Table 2. The following provides two sample commands for copying files to the destination directories.
Some files need to be renamed after being copied to the destination folders.
Sample commands:1 2
cp /opt/spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala /opt/Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala cp /opt/breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala /opt/Bigdata_ML_ALGO_ACC_LIB/ml-accelerator/src/main/scala/breeze/optimize/FirstOrderMinimizerX.scala
Table 1 Spark files required in the Spark-ml-algo-lib project Directory in the Spark-ml-algo-lib Project
File Name in the Spark-ml-algo-lib Project
Original Directory in Spark
Original File Name in Spark
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/classification/
GBTClassifier.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/classification/
GBTClassifier.scala
LinearSVC.scala
LinearSVC.scala
RandomForestClassifier.scala
RandomForestClassifier.scala
DecisionTreeClassifier.scala
DecisionTreeClassifier.scala
LogisticRegression.scala
LogisticRegression.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/aggregator/
DifferentiableLossAggregatorX.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/aggregator/
DifferentiableLossAggregator.scala
HingeAggregatorX.scala
HingeAggregator.scala
HuberAggregatorX.scala
HuberAggregator.scala
LeastSquaresAggregatorX.scala
LeastSquaresAggregator.scala
LogisticAggregatorX.scala
LogisticAggregator.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/loss/
RDDLossFunctionX.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/loss/
RDDLossFunction.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/regression/
DecisionTreeRegressor.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/loss/
DecisionTreeRegressor.scala
GBTRegressor.scala
GBTRegressor.scala
LinearRegression.scala
LinearRegression.scala
RandomForestRegressor.scala
RandomForestRegressor.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
NodeIdCache.scala
NodeIdCache.scala
RandomForest.scala
RandomForest.scala
RandomForest4GBDTX.scala
RandomForest.scala
RandomForestRaw.scala
RandomForest.scala
DecisionForest.scala
RandomForest.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/
treeParams.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/
treeParams.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering/
KMACCm.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/clustering
KMeans.scala
KMeans.scala
KMeans.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed/
RowMatrix.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/linalg/distributed
RowMatrix.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/
EigenValueDecomposition.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/linalg
EigenValueDecomposition.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree/
DecisionTree.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/tree
DecisionTree.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/
Node.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/
Node.scala
Split.scala
Split.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/impl
BaggedPoint.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/impl/
BaggedPoint.scala
DTFeatureStatsAggregator.scala
DTStatsAggregator.scala
DTStatsAggregator.scala
DTStatsAggregator.scala
GradientBoostedTreesCore.scala
RandomForest.scala
TreePointX.scala
TreePoint.scala
TreePointY.scala
TreePoint.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity/
Entropy.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity
Entropy.scala
Gini.scala
Gini.scala
Impurities.scala
Impurities.scala
Impurity.scala
Impurity.scala
Variance.scala
Variance.scala
Table 2 Breeze files required in the Spark-ml-algo-lib project Directory in the Spark-ml-algo-lib Project
File Name in the Spark-ml-algo-lib Project
Original Directory in Breeze
Original File Name in Breeze
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/breeze/optimize
FirstOrderMinimizerX.scala
breeze-releases-v0.13.1/math/src/ main/scala/breeze/optimize
FirstOrderMinimizer.scala
LBFGSX.scala
LBFGS.scala
OWLQNX.scala
OWLQN.scala
After operations in 4, the directory structure of the Spark-ml-algo-lib project and the files in the directory are as follows:
Spark-ml-algo-lib ├── ml-accelerator │ └── src │ └── main │ └── scala │ ├── breeze │ │ └── optimize │ │ ├── FirstOrderMinimizerX.scala │ │ ├── LBFGSX.scala │ │ └── OWLQNX.scala │ └── org │ └── apache │ └── spark │ ├── ml │ │ ├── classification │ │ │ ├── DecisionTreeClassifier.scala │ │ │ ├── GBTClassifier.scala │ │ │ ├── LinearSVC.scala │ │ │ ├── LogisticRegression.scala │ │ │ └── RandomForestClassifier.scala │ │ ├── optim │ │ │ ├── aggregator │ │ │ │ ├── DifferentiableLossAggregatorX.scala │ │ │ │ ├── HingeAggregatorX.scala │ │ │ │ ├── HuberAggregatorX.scala │ │ │ │ ├── LeastSquaresAggregatorX.scala │ │ │ │ └── LogisticAggregatorX.scala │ │ │ └── loss │ │ │ └── RDDLossFunctionX.scala │ │ ├── regression │ │ │ ├── DecisionTreeRegressor.scala │ │ │ ├── GBTRegressor.scala │ │ │ ├── LinearRegression.scala │ │ │ └── RandomForestRegressor.scala │ │ └── tree │ │ ├── impl │ │ │ ├── DecisionForest.scala │ │ │ ├── GradientBoostedTrees.scala │ │ │ ├── NodeIdCache.scala │ │ │ ├── RandomForest4GBDTX.scala │ │ │ ├── RandomForestRaw.scala │ │ │ └── RandomForest.scala │ │ └── treeParams.scala │ └── mllib │ ├── clustering │ │ ├── KMACCm.scala │ │ └── KMeans.scala │ ├── linalg │ │ ├── distributed │ │ │ └── RowMatrix.scala │ │ └── EigenValueDecomposition.scala │ └── tree │ └── DecisionTree.scala └── ml-core └── src └── main └── scala └── org └── apache └── spark ├── ml │ └── tree │ ├── impl │ │ ├── BaggedPoint.scala │ │ ├── DTFeatureStatsAggregator.scala │ │ ├── DTStatsAggregator.scala │ │ ├── GradientBoostedTreesCore.scala │ │ ├── TreePointX.scala │ │ └── TreePointY.scala │ ├── Node.scala │ └── Split.scala └── mllib └── tree └── impurity ├── Entropy.scala ├── Gini.scala ├── Impurities.scala ├── Impurity.scala └── Variance.scala - Download Spark-ml-algo-lib.patch to the /opt/Spark-ml-algo-lib/ directory, decompress the patch package, and import it to Spark-ml-algo-lib, to obtain the complete adaptation code Spark-ml-algo-lib of the machine learning algorithm library.
1 2 3
cd /opt/Spark-ml-algo-lib wget https://github.com/kunpengcompute/Spark-ml-algo-lib/releases/download/v1.1.0/Spark-ml-algo-lib.patch patch -p1 < Spark-ml-algo-lib.patch
The directory structure of the complete adaptation code Spark-ml-algo-lib of the machine learning algorithm library and the files in the directory are as follows:
Spark-ml-algo-lib ├── LICENSE ├── ml-accelerator │ ├── pom.xml │ └── src │ └── main │ └── scala │ ├── breeze │ │ └── optimize │ │ ├── FirstOrderMinimizerX.scala │ │ ├── LBFGSX.scala │ │ └── OWLQNX.scala │ └── org │ └── apache │ └── spark │ ├── ml │ │ ├── classification │ │ │ ├── DecisionTreeClassifier.scala │ │ │ ├── GBTClassifier.scala │ │ │ ├── LinearSVC.scala │ │ │ ├── LogisticRegression.scala │ │ │ └── RandomForestClassifier.scala │ │ ├── optim │ │ │ ├── aggregator │ │ │ │ ├── DifferentiableLossAggregatorX.scala │ │ │ │ ├── HingeAggregatorX.scala │ │ │ │ ├── HuberAggregatorX.scala │ │ │ │ ├── LeastSquaresAggregatorX.scala │ │ │ │ └── LogisticAggregatorX.scala │ │ │ └── loss │ │ │ └── RDDLossFunctionX.scala │ │ ├── regression │ │ │ ├── DecisionTreeRegressor.scala │ │ │ ├── GBTRegressor.scala │ │ │ ├── LinearRegression.scala │ │ │ └── RandomForestRegressor.scala │ │ └── tree │ │ ├── impl │ │ │ ├── DecisionForest.scala │ │ │ ├── GradientBoostedTrees.scala │ │ │ ├── NodeIdCache.scala │ │ │ ├── RandomForest4GBDTX.scala │ │ │ ├── RandomForestRaw.scala │ │ │ └── RandomForest.scala │ │ └── treeParams.scala │ └── mllib │ ├── clustering │ │ ├── KMACCm.scala │ │ └── KMeans.scala │ ├── linalg │ │ ├── distributed │ │ │ └── RowMatrix.scala │ │ └── EigenValueDecomposition.scala │ └── tree │ └── DecisionTree.scala ├── ml-core │ ├── pom.xml │ └── src │ └── main │ └── scala │ └── org │ └── apache │ └── spark │ ├── ml │ │ └── tree │ │ ├── impl │ │ │ ├── BaggedPoint.scala │ │ │ ├── DTFeatureStatsAggregator.scala │ │ │ ├── DTStatsAggregator.scala │ │ │ ├── GradientBoostedTreesCore.scala │ │ │ ├── TreePointX.scala │ │ │ └── TreePointY.scala │ │ ├── Node.scala │ │ └── Split.scala │ └── mllib │ └── tree │ └── impurity │ ├── Entropy.scala │ ├── Gini.scala │ ├── Impurities.scala │ ├── Impurity.scala │ └── Variance.scala ├── ml-kernel-client │ ├── pom.xml │ └── src │ └── main │ └── scala │ ├── breeze │ │ ├── linalg │ │ │ ├── blas │ │ │ │ ├── Dgemv.scala │ │ │ │ └── Gramian.scala │ │ │ ├── DenseMatrixUtil.scala │ │ │ ├── DenseVectorUtil.scala │ │ │ └── lapack │ │ │ └── EigenDecomposition.scala │ │ └── optimize │ │ ├── ACC.scala │ │ ├── LBFGSL.scala │ │ └── OWLQNL.scala │ └── org │ └── apache │ └── spark │ ├── ml │ │ └── tree │ │ └── impl │ │ ├── DTUtils.scala │ │ ├── GradientBoostedTreesUtil.scala │ │ └── RFUtils.scala │ ├── mllib.clustering │ │ └── KmeansUtil.scala │ └── mllib.linalg.distributed │ └── RowMatrixUtil.scala ├── pom.xml ├── README.md └── scalastyle-config.xml