构建机器学习Spark算法库适配代码
构建机器学习算法加速库适配代码Spark-ml-algo-lib过程:
- 下载Spark 2.3.2源码zip包到“/opt/”目录并解压,得到Spark源码目录“/opt/ spark-2.3.2”。
获取地址:https://github.com/apache/spark/archive/v2.3.2.zip
wget https://github.com/apache/spark/archive/v2.3.2.zip unzip v2.3.2.zip
- 获取Breeze 0.13.1源码zip包到“/opt/”目录并解压,得到Breeze源码目录“/opt/breeze-releases-v0.13.1”。
获取地址:https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip
wget https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip unzip v0.13.1.zip
- 在“/opt/”目录下建立一个层级为如下所示的目录的工程Spark-ml-algo-lib。
cd /opt/ mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/breeze/optimize mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/aggregator mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/loss mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/recommendation mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/regression mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity
- 按照表1、表2的对应关系将Spark 2.3.2和Breeze 0.13.1中的对应原文件复制到Spark-ml-algo-lib目录,表格左边两列是目标目录和文件名,右边两列的是需要移动的原文件目录及文件名。由于需要复制的文件很多,操作的代码只给出两个示例。
有些文件在复制到目标文件夹后需要改名。
操作命令示例:1 2
cp /opt/spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala /opt/Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala cp /opt/breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala /opt/Bigdata_ML_ALGO_ACC_LIB/ml-accelerator/src/main/scala/breeze/optimize/FirstOrderMinimizerX.scala
表1 Spark中需要放入Spark-ml-algo-lib工程的文件 Spark-ml-algo-lib工程目录
Spark-ml-algo-lib工程文件名
Spark原文件所在目录
Spark原文件名
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/classification/
GBTClassifier.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/classification/
GBTClassifier.scala
LinearSVC.scala
LinearSVC.scala
RandomForestClassifier.scala
RandomForestClassifier.scala
DecisionTreeClassifier.scala
DecisionTreeClassifier.scala
LogisticRegression.scala
LogisticRegression.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/aggregator/
DifferentiableLossAggregatorX.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/aggregator/
DifferentiableLossAggregator.scala
HingeAggregatorX.scala
HingeAggregator.scala
HuberAggregatorX.scala
HuberAggregator.scala
LeastSquaresAggregatorX.scala
LeastSquaresAggregator.scala
LogisticAggregatorX.scala
LogisticAggregator.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/loss/
RDDLossFunctionX.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/loss/
RDDLossFunction.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/regression/
DecisionTreeRegressor.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/loss/
DecisionTreeRegressor.scala
GBTRegressor.scala
GBTRegressor.scala
LinearRegression.scala
LinearRegression.scala
RandomForestRegressor.scala
RandomForestRegressor.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
NodeIdCache.scala
NodeIdCache.scala
RandomForest.scala
RandomForest.scala
RandomForest4GBDTX.scala
RandomForest.scala
RandomForestRaw.scala
RandomForest.scala
DecisionForest.scala
RandomForest.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/
treeParams.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/
treeParams.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering/
KMACCm.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/clustering
KMeans.scala
KMeans.scala
KMeans.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed/
RowMatrix.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/linalg/distributed
RowMatrix.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/
EigenValueDecomposition.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/linalg
EigenValueDecomposition.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree/
DecisionTree.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/tree
DecisionTree.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/
Node.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/
Node.scala
Split.scala
Split.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/impl
BaggedPoint.scala
spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/impl/
BaggedPoint.scala
DTFeatureStatsAggregator.scala
DTStatsAggregator.scala
DTStatsAggregator.scala
DTStatsAggregator.scala
GradientBoostedTreesCore.scala
RandomForest.scala
TreePointX.scala
TreePoint.scala
TreePointY.scala
TreePoint.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity/
Entropy.scala
spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity
Entropy.scala
Gini.scala
Gini.scala
Impurities.scala
Impurities.scala
Impurity.scala
Impurity.scala
Variance.scala
Variance.scala
表2 Breeze中需要放入Spark-ml-algo-lib工程的文件 Spark-ml-algo-lib工程目录
Spark-ml-algo-lib工程文件名
Breeze原文件所在目录
Breeze原文件名
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/breeze/optimize
FirstOrderMinimizerX.scala
breeze-releases-v0.13.1/math/src/ main/scala/breeze/optimize
FirstOrderMinimizer.scala
LBFGSX.scala
LBFGS.scala
OWLQNX.scala
OWLQN.scala
完成4后,Spark-ml-algo-lib工程的目录结构及目录下的文件如下:
Spark-ml-algo-lib ├── ml-accelerator │ └── src │ └── main │ └── scala │ ├── breeze │ │ └── optimize │ │ ├── FirstOrderMinimizerX.scala │ │ ├── LBFGSX.scala │ │ └── OWLQNX.scala │ └── org │ └── apache │ └── spark │ ├── ml │ │ ├── classification │ │ │ ├── DecisionTreeClassifier.scala │ │ │ ├── GBTClassifier.scala │ │ │ ├── LinearSVC.scala │ │ │ ├── LogisticRegression.scala │ │ │ └── RandomForestClassifier.scala │ │ ├── optim │ │ │ ├── aggregator │ │ │ │ ├── DifferentiableLossAggregatorX.scala │ │ │ │ ├── HingeAggregatorX.scala │ │ │ │ ├── HuberAggregatorX.scala │ │ │ │ ├── LeastSquaresAggregatorX.scala │ │ │ │ └── LogisticAggregatorX.scala │ │ │ └── loss │ │ │ └── RDDLossFunctionX.scala │ │ ├── regression │ │ │ ├── DecisionTreeRegressor.scala │ │ │ ├── GBTRegressor.scala │ │ │ ├── LinearRegression.scala │ │ │ └── RandomForestRegressor.scala │ │ └── tree │ │ ├── impl │ │ │ ├── DecisionForest.scala │ │ │ ├── GradientBoostedTrees.scala │ │ │ ├── NodeIdCache.scala │ │ │ ├── RandomForest4GBDTX.scala │ │ │ ├── RandomForestRaw.scala │ │ │ └── RandomForest.scala │ │ └── treeParams.scala │ └── mllib │ ├── clustering │ │ ├── KMACCm.scala │ │ └── KMeans.scala │ ├── linalg │ │ ├── distributed │ │ │ └── RowMatrix.scala │ │ └── EigenValueDecomposition.scala │ └── tree │ └── DecisionTree.scala └── ml-core └── src └── main └── scala └── org └── apache └── spark ├── ml │ └── tree │ ├── impl │ │ ├── BaggedPoint.scala │ │ ├── DTFeatureStatsAggregator.scala │ │ ├── DTStatsAggregator.scala │ │ ├── GradientBoostedTreesCore.scala │ │ ├── TreePointX.scala │ │ └── TreePointY.scala │ ├── Node.scala │ └── Split.scala └── mllib └── tree └── impurity ├── Entropy.scala ├── Gini.scala ├── Impurities.scala ├── Impurity.scala └── Variance.scala
- 下载Spark-ml-algo-lib.patch到“/opt/Spark-ml-algo-lib/”目录下,将patch解压后并入Spark-ml-algo-lib,得到完整的机器学习算法加速库适配代码Spark-ml-algo-lib。
1 2 3
cd /opt/Spark-ml-algo-lib wget https://github.com/kunpengcompute/Spark-ml-algo-lib/releases/download/v1.1.0/Spark-ml-algo-lib.patch patch -p1 < Spark-ml-algo-lib.patch
完整的机器学习算法加速库适配代码Spark-ml-algo-lib的目录及目录下的文件如下:
Spark-ml-algo-lib ├── LICENSE ├── ml-accelerator │ ├── pom.xml │ └── src │ └── main │ └── scala │ ├── breeze │ │ └── optimize │ │ ├── FirstOrderMinimizerX.scala │ │ ├── LBFGSX.scala │ │ └── OWLQNX.scala │ └── org │ └── apache │ └── spark │ ├── ml │ │ ├── classification │ │ │ ├── DecisionTreeClassifier.scala │ │ │ ├── GBTClassifier.scala │ │ │ ├── LinearSVC.scala │ │ │ ├── LogisticRegression.scala │ │ │ └── RandomForestClassifier.scala │ │ ├── optim │ │ │ ├── aggregator │ │ │ │ ├── DifferentiableLossAggregatorX.scala │ │ │ │ ├── HingeAggregatorX.scala │ │ │ │ ├── HuberAggregatorX.scala │ │ │ │ ├── LeastSquaresAggregatorX.scala │ │ │ │ └── LogisticAggregatorX.scala │ │ │ └── loss │ │ │ └── RDDLossFunctionX.scala │ │ ├── regression │ │ │ ├── DecisionTreeRegressor.scala │ │ │ ├── GBTRegressor.scala │ │ │ ├── LinearRegression.scala │ │ │ └── RandomForestRegressor.scala │ │ └── tree │ │ ├── impl │ │ │ ├── DecisionForest.scala │ │ │ ├── GradientBoostedTrees.scala │ │ │ ├── NodeIdCache.scala │ │ │ ├── RandomForest4GBDTX.scala │ │ │ ├── RandomForestRaw.scala │ │ │ └── RandomForest.scala │ │ └── treeParams.scala │ └── mllib │ ├── clustering │ │ ├── KMACCm.scala │ │ └── KMeans.scala │ ├── linalg │ │ ├── distributed │ │ │ └── RowMatrix.scala │ │ └── EigenValueDecomposition.scala │ └── tree │ └── DecisionTree.scala ├── ml-core │ ├── pom.xml │ └── src │ └── main │ └── scala │ └── org │ └── apache │ └── spark │ ├── ml │ │ └── tree │ │ ├── impl │ │ │ ├── BaggedPoint.scala │ │ │ ├── DTFeatureStatsAggregator.scala │ │ │ ├── DTStatsAggregator.scala │ │ │ ├── GradientBoostedTreesCore.scala │ │ │ ├── TreePointX.scala │ │ │ └── TreePointY.scala │ │ ├── Node.scala │ │ └── Split.scala │ └── mllib │ └── tree │ └── impurity │ ├── Entropy.scala │ ├── Gini.scala │ ├── Impurities.scala │ ├── Impurity.scala │ └── Variance.scala ├── ml-kernel-client │ ├── pom.xml │ └── src │ └── main │ └── scala │ ├── breeze │ │ ├── linalg │ │ │ ├── blas │ │ │ │ ├── Dgemv.scala │ │ │ │ └── Gramian.scala │ │ │ ├── DenseMatrixUtil.scala │ │ │ ├── DenseVectorUtil.scala │ │ │ └── lapack │ │ │ └── EigenDecomposition.scala │ │ └── optimize │ │ ├── ACC.scala │ │ ├── LBFGSL.scala │ │ └── OWLQNL.scala │ └── org │ └── apache │ └── spark │ ├── ml │ │ └── tree │ │ └── impl │ │ ├── DTUtils.scala │ │ ├── GradientBoostedTreesUtil.scala │ │ └── RFUtils.scala │ ├── mllib.clustering │ │ └── KmeansUtil.scala │ └── mllib.linalg.distributed │ └── RowMatrixUtil.scala ├── pom.xml ├── README.md └── scalastyle-config.xml