Building Adaptation Code for the Machine Learning Algorithm Library
- The process of building the adaptation code Spark-ml-algo-lib for the machine learning algorithm library is as follows. This section uses the build process that adapts to the Spark 2.3.2 code as an example. The process that adapts to the Spark 2.4.6 and Spark 3.1.1 code is similar.
- Perform the following operations in the Linux environment. This section is for reference only.
- Download the Spark 2.3.2 source code ZIP file to the /opt/ directory and decompress it. The Spark source code directory is generated.
Download URL: https://github.com/apache/spark/archive/v2.3.2.zip
1wget https://github.com/apache/spark/archive/v2.3.2.zip - Download the Breeze 0.13.1 source code ZIP file to the /opt/ directory and decompress it. The Breeze source code directory is generated.
Download URL: https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip
1wget https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip - Download the XGBoost 1.1.0 source code ZIP file to the /opt/ directory and decompress it. The XGBoost source code directory is generated.
- Obtain the CUB source package and decompress it in the /opt/xgboost-1.1.0 directory to obtain the CUB source directory /opt/xgboost-1.1.0/cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad. Then, delete the /opt/xgboost-1.1.0/cub directory, rename the /opt/xgboost-1.1.0/cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad directory as /opt/xgboost-1.1.0/cub.
1 2 3 4
wget https://github.com/NVlabs/cub/archive/b20808b1b04ec3d6a625e51fbc1eb76f337754ad.zip unzip cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad.zip rm -rf cub mv cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad cub
- Obtain the dmlc-core source package and decompress it in the /opt/xgboost-1.1.0 directory to obtain the dmlc-core source directory /opt/xgboost-1.1.0/dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407. Then delete the /opt/xgboost-1.1.0/dmlc-core directory and rename the /opt/xgboost-1.1.0/dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407 directory as /opt/xgboost-1.1.0/dmlc-core.
1 2 3 4
wget https://github.com/dmlc/dmlc-core/archive/5df8305fe699d3b503d10c60a231ab0223142407.zip unzip dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407.zip rm -rf dmlc-core mv dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407 dmlc-core
- Obtain the Rabit source package and decompress it in the /opt/xgboost-1.1.0 directory to obtain the /opt/xgboost-1.1.0/rabit-4fb34a008db6437c84d1877635064e09a55c8553 directory. Then, delete the /opt/xgboost-1.1.0/rabit directory, rename the /opt/xgboost-1.1.0/rabit-4fb34a008db6437c84d1877635064e09a55c8553 directory as /opt/xgboost-1.1.0/rabit.
1 2 3 4
wget https://github.com/dmlc/rabit/archive/4fb34a008db6437c84d1877635064e09a55c8553.zip unzip rabit-4fb34a008db6437c84d1877635064e09a55c8553.zip rm -rf rabit mv rabit-4fb34a008db6437c84d1877635064e09a55c8553 rabit
- Download the Netlib source package to the /opt/ directory and decompress it. The netlib-2.2.1 source code directory is generated.
Download link: https://github.com/luhenry/netlib/archive/refs/tags/v2.2.1.zip
- In the /opt/ directory, create a project named Spark-ml-algo-lib with the following directory structure.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cd /opt/ mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/breeze/optimize mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/feature mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/aggregator mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/loss mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/recommendation mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/regression mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/stat mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/stat/correlation mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity mkdir -p Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/blas cp -r xgboost-1.1.0 Spark-ml-algo-lib/ml-xgboost
- Copy the original files in Spark 2.3.2 and Breeze 0.13.1 to the Spark-ml-algo-lib directories according to the mapping in Table 1 and Table 2. Delete unnecessary native code of XGBoost according to 1, and copy the remaining code to the Spark-ml-algo-lib/ml-xgboost directory. Change the names of some folders according to Table 4. The first column lists the names of the current directories, and the second column lists the names of the modified directories. The following provides two sample commands for copying files to the destination directories.
Some files need to be renamed after being copied to the destination folders.
Sample commands:1 2
cp /opt/spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala /opt/Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala cp /opt/breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala /opt/Bigdata_ML_ALGO_ACC_LIB/ml-accelerator/src/main/scala/breeze/optimize/FirstOrderMinimizerX.scala
Table 1 Spark files required in the Spark-ml-algo-lib project Directory in the Spark-ml-algo-lib Project
File Name in the Spark-ml-algo-lib Project
Original Directory in Spark
Original File Name in Spark
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/classification/
GBTClassifier.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/
GBTClassifier.scala
LinearSVC.scala
LinearSVC.scala
RandomForestClassifier.scala
RandomForestClassifier.scala
DecisionTreeClassifier.scala
DecisionTreeClassifier.scala
LogisticRegression.scala
LogisticRegression.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/feature
IDF.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/feature
IDF.scala
Word2Vec.scala
Word2Vec.scala
DecisionTreeBucketizer.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification
RandomForestClassifier.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/aggregator/
DifferentiableLossAggregatorX.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/
DifferentiableLossAggregator.scala
HingeAggregatorX.scala
HingeAggregator.scala
HuberAggregatorX.scala
HuberAggregator.scala
LeastSquaresAggregatorX.scala
LeastSquaresAggregator.scala
LogisticAggregatorX.scala
LogisticAggregator.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/loss/
RDDLossFunctionX.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/optim/loss/
RDDLossFunction.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/recommendation/
ALS.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/recommendation
ALS.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/regression/
DecisionTreeRegressor.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/regression/
DecisionTreeRegressor.scala
GBTRegressor.scala
GBTRegressor.scala
LinearRegression.scala
LinearRegression.scala
RandomForestRegressor.scala
RandomForestRegressor.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/stat/
Correlation.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/stat/
Correlation.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
NodeIdCache.scala
NodeIdCache.scala
RandomForest.scala
RandomForest.scala
RandomForest4GBDTX.scala
RandomForest.scala
RandomForestRaw.scala
RandomForest.scala
DecisionForest.scala
RandomForest.scala
DecisionTreeBucket.scala
RandomForest.scala
DecisionTreeMetadata.scala
DecisionTreeMetadata.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/
treeParams.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/
treeParams.scala
treeModels.scala
treeModels.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering/
KMACCm.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/clustering
KMeans.scala
KMeans.scala
KMeans.scala
LDA.scala
LDA.scala
LDAOptimizer.scala
LDAOptimizer.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature
IDF.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/feature/
IDF.scala
Word2Vec.scala
Word2Vec.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm/
PrefixSpan.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/fpm
PrefixSpan.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed/
RowMatrix.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed
RowMatrix.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/
EigenValueDecomposition.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/linalg
EigenValueDecomposition.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/stat/correlation/
Correlation.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/
Correlation.scala
PearsonCorrelation.scala
PearsonCorrelation.scala
SpearmanCorrelation.scala
SpearmanCorrelation.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree/
DecisionTree.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree
DecisionTree.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/
Node.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/
Node.scala
Split.scala
Split.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/impl
BaggedPoint.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/impl/
BaggedPoint.scala
DTFeatureStatsAggregator.scala
DTStatsAggregator.scala
DTStatsAggregator.scala
DTStatsAggregator.scala
GradientBoostedTreesCore.scala
RandomForest.scala
TreePointX.scala
TreePoint.scala
TreePointY.scala
TreePoint.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering/
LDAUtilsX.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/clustering
LDAUtils.scala
OnlineLDAOptimizerXObj.scala
LDAOptimizer.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm/
LocalPrefixSpan.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/fpm/
LocalPrefixSpan.scala
PrefixSpanBase.scala
PrefixSpan.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity/
Entropy.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity
Entropy.scala
Gini.scala
Gini.scala
Impurities.scala
Impurities.scala
Impurity.scala
Impurity.scala
Variance.scala
Variance.scala
Table 2 Breeze files required in the Spark-ml-algo-lib project Directory in the Spark-ml-algo-lib Project
File Name in the Spark-ml-algo-lib Project
Original Directory in Breeze
Original File Name in Breeze
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/breeze/optimize
FirstOrderMinimizerX.scala
breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize
FirstOrderMinimizer.scala
LBFGSX.scala
LBFGS.scala
OWLQNX.scala
OWLQN.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/breeze/numerics/
DigammaX.scala
breeze-releases-v0.13.1/math/src/main/scala/breeze/numerics/
package.scala
Table 3 Netlib files required in the Spark-ml-algo-lib project Directory in the Spark-ml-algo-lib Project
File Name in the Spark-ml-algo-lib Project
Original Directory in Netlib
Original File Name in Netlib
Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/
BLAS.java
netlib-2.2.1/blas/src/main/java/dev/ludovic/netlib
BLAS.java
InstanceBuilder.java
InstanceBuilder.java
JavaBLAS.java
JavaBLAS.java
NativeBLAS.java
NativeBLAS.java
Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/blas
AbstractBLAS.java
netlib-2.2.1/blas/src/main/java/dev/ludovic/netlib/blas
AbstractBLAS.java
F2jBLAS.java
F2jBLAS.java
JNIBLAS.java
JNIBLAS.java
Java8BLAS.java
Java8BLAS.java
Table 4 Directories whose name need to be changed Directory in the Spark-ml-algo-lib Project
New Name of the Directory.
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-example
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-example
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-flink
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-flink
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-spark
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-spark
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-tester
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-tester
Files or directories to be deleted from the XGBoost native code:
- xgboost-1.1.0/.github
- xgboost-1.1.0/cub/.settings
- xgboost-1.1.0/cub/.project
- xgboost-1.1.0/dmlc-core/.github
- xgboost-1.1.0/dmlc-core/make/config.mk
- xgboost-1.1.0/dmlc-core/test/unittest/sample.rec
- xgboost-1.1.0/doc/_static
- xgboost-1.1.0/rabit/lib
- xgboost-1.1.0/R-package/data
- xgboost-1.1.0/.gitignore
- Download the patch to the /opt/Spark-ml-algo-lib/ directory. Take Spark 2.3.2 as an example. Integrate the patch of Spark 2.3.2 into Spark-ml-algo-lib to obtain the complete adaptation code Spark-ml-algo-lib of the machine learning algorithm library.
1 2 3
cd /opt/Spark-ml-algo-lib wget https://github.com/kunpengcompute/Spark-ml-algo-lib/releases/download/v2.2.0-spark2.3.2/Spark-ml-algo-lib-Spark2.3.2.patch patch -p1 < Spark-ml-algo-lib-Spark2.3.2.patch
The directory structure of the complete adaptation code Spark-ml-algo-lib of the machine learning algorithm library is the same as that in the repository.