构建机器学习算法加速库适配代码

- 构建机器学习算法加速库适配代码Spark-ml-algo-lib过程如下。此过程以适配Spark 2.3.2代码的构建为例,适配Spark 2.4.6/Spark 3.1.1代码与之相似,可参考以下操作。
- 以下操作请在Linux环境下操作,该章节仅供参考。
- 下载Spark 2.3.2源码zip包到“/opt/”目录并解压,得到Spark源码目录。
获取地址:https://github.com/apache/spark/archive/v2.3.2.zip
1
wget https://github.com/apache/spark/archive/v2.3.2.zip
- 获取Breeze 0.13.1源码zip包到“/opt/”目录并解压,得到Breeze源码目录。
获取地址:https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip
1
wget https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip
- 获取XGBoost 1.1.0源码包到“/opt/”目录并解压,得到XGBoost源码目录。
- 获取cub源码包到“/opt/xgboost-1.1.0”目录中并解压,得到cub源码目录“/opt/xgboost-1.1.0/cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad”,然后将“/opt/xgboost-1.1.0/cub”目录删除,删除后将“/opt/xgboost-1.1.0/cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad”目录重命名为“/opt/xgboost-1.1.0/cub”。
1 2 3 4
wget https://github.com/NVlabs/cub/archive/b20808b1b04ec3d6a625e51fbc1eb76f337754ad.zip unzip cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad.zip rm -rf cub mv cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad cub
- 获取dmlc-core源码包到“/opt/xgboost-1.1.0”目录中并解压,得到dmlc-core源码目录“/opt/xgboost-1.1.0/dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407”,然后将“/opt/xgboost-1.1.0/dmlc-core”目录删除,删除后将“/opt/xgboost-1.1.0/dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407”目录重命名为“/opt/xgboost-1.1.0/dmlc-core”。
1 2 3 4
wget https://github.com/dmlc/dmlc-core/archive/5df8305fe699d3b503d10c60a231ab0223142407.zip unzip dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407.zip rm -rf dmlc-core mv dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407 dmlc-core
- 获取rabit源码包到“/opt/xgboost-1.1.0”目录中并解压,得到rabit源码目录“/opt/xgboost-1.1.0/rabit-4fb34a008db6437c84d1877635064e09a55c8553”,然后将“/opt/xgboost-1.1.0/rabit”目录删除,删除后将“/opt/xgboost-1.1.0/rabit-4fb34a008db6437c84d1877635064e09a55c8553”目录重命名为“/opt/xgboost-1.1.0/rabit”。
1 2 3 4
wget https://github.com/dmlc/rabit/archive/4fb34a008db6437c84d1877635064e09a55c8553.zip unzip rabit-4fb34a008db6437c84d1877635064e09a55c8553.zip rm -rf rabit mv rabit-4fb34a008db6437c84d1877635064e09a55c8553 rabit
- 获取netlib源码包到“/opt/”目录并解压,得到netlib-2.2.1源码目录。
获取地址:https://github.com/luhenry/netlib/archive/refs/tags/v2.2.1.zip
- 在“/opt/”目录下建立一个层级为如下所示的目录的工程Spark-ml-algo-lib。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cd /opt/ mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/breeze/optimize mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/feature mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/aggregator mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/loss mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/recommendation mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/regression mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/stat mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/stat/correlation mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity mkdir -p Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/blas cp -r xgboost-1.1.0 Spark-ml-algo-lib/ml-xgboost
- 按照表1、表2的对应关系将Spark 2.3.2和Breeze 0.13.1中的对应原文件复制到Spark-ml-algo-lib目录,表格左边两列是目标目录和文件名,右边两列的是需要移动的原文件目录及文件名。按照XGBoost原生代码中需要删除的文件或目录将XGBoost原生的代码中不需要的部分删除,然后将剩下的代码拷贝至“Spark-ml-algo-lib/ml-xgboost”目录下。按照表4将部分文件夹修改为所需要的名字,第一列为当前目录的名字,第二列为修改后目录的名字。由于需要复制的文件很多,操作的代码只给出两个示例。
有些文件在复制到目标文件夹后需要改名。
操作命令示例:1 2
cp /opt/spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala /opt/Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala cp /opt/breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala /opt/Bigdata_ML_ALGO_ACC_LIB/ml-accelerator/src/main/scala/breeze/optimize/FirstOrderMinimizerX.scala
表1 Spark中需要放入Spark-ml-algo-lib工程的文件 Spark-ml-algo-lib工程目录
Spark-ml-algo-lib工程文件名
Spark原文件所在目录
Spark原文件名
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/classification/
GBTClassifier.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/
GBTClassifier.scala
LinearSVC.scala
LinearSVC.scala
RandomForestClassifier.scala
RandomForestClassifier.scala
DecisionTreeClassifier.scala
DecisionTreeClassifier.scala
LogisticRegression.scala
LogisticRegression.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/feature
IDF.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/feature
IDF.scala
Word2Vec.scala
Word2Vec.scala
DecisionTreeBucketizer.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification
RandomForestClassifier.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/aggregator/
DifferentiableLossAggregatorX.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/
DifferentiableLossAggregator.scala
HingeAggregatorX.scala
HingeAggregator.scala
HuberAggregatorX.scala
HuberAggregator.scala
LeastSquaresAggregatorX.scala
LeastSquaresAggregator.scala
LogisticAggregatorX.scala
LogisticAggregator.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/loss/
RDDLossFunctionX.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/optim/loss/
RDDLossFunction.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/recommendation/
ALS.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/recommendation
ALS.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/regression/
DecisionTreeRegressor.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/regression/
DecisionTreeRegressor.scala
GBTRegressor.scala
GBTRegressor.scala
LinearRegression.scala
LinearRegression.scala
RandomForestRegressor.scala
RandomForestRegressor.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/stat/
Correlation.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/stat/
Correlation.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
NodeIdCache.scala
NodeIdCache.scala
RandomForest.scala
RandomForest.scala
RandomForest4GBDTX.scala
RandomForest.scala
RandomForestRaw.scala
RandomForest.scala
DecisionForest.scala
RandomForest.scala
DecisionTreeBucket.scala
RandomForest.scala
DecisionTreeMetadata.scala
DecisionTreeMetadata.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/
treeParams.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/
treeParams.scala
treeModels.scala
treeModels.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering/
KMACCm.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/clustering
KMeans.scala
KMeans.scala
KMeans.scala
LDA.scala
LDA.scala
LDAOptimizer.scala
LDAOptimizer.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature
IDF.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/feature/
IDF.scala
Word2Vec.scala
Word2Vec.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm/
PrefixSpan.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/fpm
PrefixSpan.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed/
RowMatrix.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed
RowMatrix.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/
EigenValueDecomposition.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/linalg
EigenValueDecomposition.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/stat/correlation/
Correlation.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/
Correlation.scala
PearsonCorrelation.scala
PearsonCorrelation.scala
SpearmanCorrelation.scala
SpearmanCorrelation.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree/
DecisionTree.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree
DecisionTree.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/
Node.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/
Node.scala
Split.scala
Split.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/impl
BaggedPoint.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/impl/
BaggedPoint.scala
DTFeatureStatsAggregator.scala
DTStatsAggregator.scala
DTStatsAggregator.scala
DTStatsAggregator.scala
GradientBoostedTreesCore.scala
RandomForest.scala
TreePointX.scala
TreePoint.scala
TreePointY.scala
TreePoint.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering/
LDAUtilsX.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/clustering
LDAUtils.scala
OnlineLDAOptimizerXObj.scala
LDAOptimizer.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm/
LocalPrefixSpan.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/fpm/
LocalPrefixSpan.scala
PrefixSpanBase.scala
PrefixSpan.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity/
Entropy.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity
Entropy.scala
Gini.scala
Gini.scala
Impurities.scala
Impurities.scala
Impurity.scala
Impurity.scala
Variance.scala
Variance.scala
表2 Breeze中需要放入Spark-ml-algo-lib工程的文件 Spark-ml-algo-lib工程目录
Spark-ml-algo-lib工程文件名
Breeze原文件所在目录
Breeze原文件名
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/breeze/optimize
FirstOrderMinimizerX.scala
breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize
FirstOrderMinimizer.scala
LBFGSX.scala
LBFGS.scala
OWLQNX.scala
OWLQN.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/breeze/numerics/
DigammaX.scala
breeze-releases-v0.13.1/math/src/main/scala/breeze/numerics/
package.scala
表3 netlib中需要放入Spark-ml-algo-lib工程的文件 Spark-ml-algo-lib工程目录
Spark-ml-algo-lib工程文件名
netlib原文件所在目录
netlib原文件名
Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/
BLAS.java
netlib-2.2.1/blas/src/main/java/dev/ludovic/netlib
BLAS.java
InstanceBuilder.java
InstanceBuilder.java
JavaBLAS.java
JavaBLAS.java
NativeBLAS.java
NativeBLAS.java
Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/blas
AbstractBLAS.java
netlib-2.2.1/blas/src/main/java/dev/ludovic/netlib/blas
AbstractBLAS.java
F2jBLAS.java
F2jBLAS.java
JNIBLAS.java
JNIBLAS.java
Java8BLAS.java
Java8BLAS.java
表4 需要修改名字的目录 Spark-ml-algo-lib工程目录
修改后目录的名字
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-example
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-example
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-flink
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-flink
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-spark
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-spark
Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-tester
Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-tester
XGBoost原生代码中需要删除的文件或目录如下。
- xgboost-1.1.0/.github
- xgboost-1.1.0/cub/.settings
- xgboost-1.1.0/cub/.project
- xgboost-1.1.0/dmlc-core/.github
- xgboost-1.1.0/dmlc-core/make/config.mk
- xgboost-1.1.0/dmlc-core/test/unittest/sample.rec
- xgboost-1.1.0/doc/_static
- xgboost-1.1.0/rabit/lib
- xgboost-1.1.0/R-package/data
- xgboost-1.1.0/.gitignore
- 下载patch到“/opt/Spark-ml-algo-lib/”目录下,以Spark 2.3.2为例,将Spark 2.3.2的patch并入Spark-ml-algo-lib,得到完整的机器学习算法加速库适配代码Spark-ml-algo-lib。
1 2 3
cd /opt/Spark-ml-algo-lib wget https://github.com/kunpengcompute/Spark-ml-algo-lib/releases/download/v2.2.0-spark2.3.2/Spark-ml-algo-lib-Spark2.3.2.patch patch -p1 < Spark-ml-algo-lib-Spark2.3.2.patch
完整的机器学习算法加速库适配代码Spark-ml-algo-lib的目录与仓库代码一致。