鲲鹏社区首页
中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

构建机器学习算法加速库适配代码

  • 构建机器学习算法加速库适配代码Spark-ml-algo-lib过程如下。此过程以适配Spark 2.3.2代码的构建为例,适配Spark 2.4.6/Spark 3.1.1代码与之相似,可参考以下操作。
  • 以下操作请在Linux环境下操作,该章节仅供参考。
  1. 下载Spark 2.3.2源码zip包到“/opt/”目录并解压,得到Spark源码目录。

    获取地址:https://github.com/apache/spark/archive/v2.3.2.zip

    1
    wget https://github.com/apache/spark/archive/v2.3.2.zip
    
  2. 获取Breeze 0.13.1源码zip包到“/opt/”目录并解压,得到Breeze源码目录。

    获取地址:https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip

    1
    wget https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip
    
  3. 获取XGBoost 1.1.0源码包到“/opt/”目录并解压,得到XGBoost源码目录。
    获取地址:https://github.com/dmlc/xgboost/archive/refs/tags/v1.1.0.zip
    1
    wget https://github.com/dmlc/xgboost/archive/refs/tags/v1.1.0.zip
    
  4. 获取cub源码包到“/opt/xgboost-1.1.0”目录中并解压,得到cub源码目录“/opt/xgboost-1.1.0/cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad”,然后将“/opt/xgboost-1.1.0/cub”目录删除,删除后将“/opt/xgboost-1.1.0/cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad”目录重命名为“/opt/xgboost-1.1.0/cub”
    1
    2
    3
    4
    wget https://github.com/NVlabs/cub/archive/b20808b1b04ec3d6a625e51fbc1eb76f337754ad.zip
    unzip cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad.zip
    rm -rf cub
    mv cub-b20808b1b04ec3d6a625e51fbc1eb76f337754ad cub
    
  5. 获取dmlc-core源码包到“/opt/xgboost-1.1.0”目录中并解压,得到dmlc-core源码目录“/opt/xgboost-1.1.0/dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407”,然后将“/opt/xgboost-1.1.0/dmlc-core”目录删除,删除后将“/opt/xgboost-1.1.0/dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407”目录重命名为“/opt/xgboost-1.1.0/dmlc-core”
    1
    2
    3
    4
    wget https://github.com/dmlc/dmlc-core/archive/5df8305fe699d3b503d10c60a231ab0223142407.zip
    unzip dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407.zip
    rm -rf dmlc-core
    mv dmlc-core-5df8305fe699d3b503d10c60a231ab0223142407 dmlc-core
    
  6. 获取rabit源码包到“/opt/xgboost-1.1.0”目录中并解压,得到rabit源码目录“/opt/xgboost-1.1.0/rabit-4fb34a008db6437c84d1877635064e09a55c8553”,然后将“/opt/xgboost-1.1.0/rabit”目录删除,删除后将“/opt/xgboost-1.1.0/rabit-4fb34a008db6437c84d1877635064e09a55c8553”目录重命名为“/opt/xgboost-1.1.0/rabit”
    1
    2
    3
    4
    wget https://github.com/dmlc/rabit/archive/4fb34a008db6437c84d1877635064e09a55c8553.zip
    unzip rabit-4fb34a008db6437c84d1877635064e09a55c8553.zip
    rm -rf rabit
    mv rabit-4fb34a008db6437c84d1877635064e09a55c8553 rabit
    
  7. 获取netlib源码包到“/opt/”目录并解压,得到netlib-2.2.1源码目录。

    获取地址:https://github.com/luhenry/netlib/archive/refs/tags/v2.2.1.zip

  8. “/opt/”目录下建立一个层级为如下所示的目录的工程Spark-ml-algo-lib。

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    cd /opt/
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/breeze/optimize
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/feature
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/aggregator
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/loss
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/recommendation
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/regression
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/stat
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tree/impl
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/stat/correlation
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tree/impl
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/blas
    cp -r xgboost-1.1.0 Spark-ml-algo-lib/ml-xgboost
    
  9. 按照表1表2的对应关系将Spark 2.3.2和Breeze 0.13.1中的对应原文件复制到Spark-ml-algo-lib目录,表格左边两列是目标目录和文件名,右边两列的是需要移动的原文件目录及文件名。按照XGBoost原生代码中需要删除的文件或目录将XGBoost原生的代码中不需要的部分删除,然后将剩下的代码拷贝至“Spark-ml-algo-lib/ml-xgboost”目录下。按照表4将部分文件夹修改为所需要的名字,第一列为当前目录的名字,第二列为修改后目录的名字。由于需要复制的文件很多,操作的代码只给出两个示例。

    有些文件在复制到目标文件夹后需要改名。

    操作命令示例:
    1
    2
    cp /opt/spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala /opt/Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
    cp /opt/breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala /opt/Bigdata_ML_ALGO_ACC_LIB/ml-accelerator/src/main/scala/breeze/optimize/FirstOrderMinimizerX.scala
    
    表1 Spark中需要放入Spark-ml-algo-lib工程的文件

    Spark-ml-algo-lib工程目录

    Spark-ml-algo-lib工程文件名

    Spark原文件所在目录

    Spark原文件名

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/classification/

    GBTClassifier.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/

    GBTClassifier.scala

    LinearSVC.scala

    LinearSVC.scala

    RandomForestClassifier.scala

    RandomForestClassifier.scala

    DecisionTreeClassifier.scala

    DecisionTreeClassifier.scala

    LogisticRegression.scala

    LogisticRegression.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/feature

    IDF.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/feature

    IDF.scala

    Word2Vec.scala

    Word2Vec.scala

    DecisionTreeBucketizer.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification

    RandomForestClassifier.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/aggregator/

    DifferentiableLossAggregatorX.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/

    DifferentiableLossAggregator.scala

    HingeAggregatorX.scala

    HingeAggregator.scala

    HuberAggregatorX.scala

    HuberAggregator.scala

    LeastSquaresAggregatorX.scala

    LeastSquaresAggregator.scala

    LogisticAggregatorX.scala

    LogisticAggregator.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/loss/

    RDDLossFunctionX.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/optim/loss/

    RDDLossFunction.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/recommendation/

    ALS.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/recommendation

    ALS.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/regression/

    DecisionTreeRegressor.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/regression/

    DecisionTreeRegressor.scala

    GBTRegressor.scala

    GBTRegressor.scala

    LinearRegression.scala

    LinearRegression.scala

    RandomForestRegressor.scala

    RandomForestRegressor.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/stat/

    Correlation.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/stat/

    Correlation.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/impl/

    GradientBoostedTrees.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/impl/

    GradientBoostedTrees.scala

    NodeIdCache.scala

    NodeIdCache.scala

    RandomForest.scala

    RandomForest.scala

    RandomForest4GBDTX.scala

    RandomForest.scala

    RandomForestRaw.scala

    RandomForest.scala

    DecisionForest.scala

    RandomForest.scala

    DecisionTreeBucket.scala

    RandomForest.scala

    DecisionTreeMetadata.scala

    DecisionTreeMetadata.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/

    treeParams.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/

    treeParams.scala

    treeModels.scala

    treeModels.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering/

    KMACCm.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/clustering

    KMeans.scala

    KMeans.scala

    KMeans.scala

    LDA.scala

    LDA.scala

    LDAOptimizer.scala

    LDAOptimizer.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature

    IDF.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/feature/

    IDF.scala

    Word2Vec.scala

    Word2Vec.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm/

    PrefixSpan.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/fpm

    PrefixSpan.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed/

    RowMatrix.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed

    RowMatrix.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/

    EigenValueDecomposition.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/linalg

    EigenValueDecomposition.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/stat/correlation/

    Correlation.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/

    Correlation.scala

    PearsonCorrelation.scala

    PearsonCorrelation.scala

    SpearmanCorrelation.scala

    SpearmanCorrelation.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree/

    DecisionTree.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree

    DecisionTree.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/

    Node.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/

    Node.scala

    Split.scala

    Split.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/impl

    BaggedPoint.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/tree/impl/

    BaggedPoint.scala

    DTFeatureStatsAggregator.scala

    DTStatsAggregator.scala

    DTStatsAggregator.scala

    DTStatsAggregator.scala

    GradientBoostedTreesCore.scala

    RandomForest.scala

    TreePointX.scala

    TreePoint.scala

    TreePointY.scala

    TreePoint.scala

    Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering/

    LDAUtilsX.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/clustering

    LDAUtils.scala

    OnlineLDAOptimizerXObj.scala

    LDAOptimizer.scala

    Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm/

    LocalPrefixSpan.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/fpm/

    LocalPrefixSpan.scala

    PrefixSpanBase.scala

    PrefixSpan.scala

    Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity/

    Entropy.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity

    Entropy.scala

    Gini.scala

    Gini.scala

    Impurities.scala

    Impurities.scala

    Impurity.scala

    Impurity.scala

    Variance.scala

    Variance.scala

    表2 Breeze中需要放入Spark-ml-algo-lib工程的文件

    Spark-ml-algo-lib工程目录

    Spark-ml-algo-lib工程文件名

    Breeze原文件所在目录

    Breeze原文件名

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/breeze/optimize

    FirstOrderMinimizerX.scala

    breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize

    FirstOrderMinimizer.scala

    LBFGSX.scala

    LBFGS.scala

    OWLQNX.scala

    OWLQN.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/breeze/numerics/

    DigammaX.scala

    breeze-releases-v0.13.1/math/src/main/scala/breeze/numerics/

    package.scala

    表3 netlib中需要放入Spark-ml-algo-lib工程的文件

    Spark-ml-algo-lib工程目录

    Spark-ml-algo-lib工程文件名

    netlib原文件所在目录

    netlib原文件名

    Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/

    BLAS.java

    netlib-2.2.1/blas/src/main/java/dev/ludovic/netlib

    BLAS.java

    InstanceBuilder.java

    InstanceBuilder.java

    JavaBLAS.java

    JavaBLAS.java

    NativeBLAS.java

    NativeBLAS.java

    Spark-ml-algo-lib/ml-core/src/main/java/dev/ludovic/netlib/blas

    AbstractBLAS.java

    netlib-2.2.1/blas/src/main/java/dev/ludovic/netlib/blas

    AbstractBLAS.java

    F2jBLAS.java

    F2jBLAS.java

    JNIBLAS.java

    JNIBLAS.java

    Java8BLAS.java

    Java8BLAS.java

    表4 需要修改名字的目录

    Spark-ml-algo-lib工程目录

    修改后目录的名字

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-example

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-example

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-flink

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-flink

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-spark

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-spark

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/xgboost4j-tester

    Spark-ml-algo-lib/ml-xgboost/jvm-packages/boostkit-xgboost4j-tester

    XGBoost原生代码中需要删除的文件或目录如下。

    • xgboost-1.1.0/.github
    • xgboost-1.1.0/cub/.settings
    • xgboost-1.1.0/cub/.project
    • xgboost-1.1.0/dmlc-core/.github
    • xgboost-1.1.0/dmlc-core/make/config.mk
    • xgboost-1.1.0/dmlc-core/test/unittest/sample.rec
    • xgboost-1.1.0/doc/_static
    • xgboost-1.1.0/rabit/lib
    • xgboost-1.1.0/R-package/data
    • xgboost-1.1.0/.gitignore
  10. 下载patch到“/opt/Spark-ml-algo-lib/”目录下,以Spark 2.3.2为例,将Spark 2.3.2的patch并入Spark-ml-algo-lib,得到完整的机器学习算法加速库适配代码Spark-ml-algo-lib。
    1
    2
    3
    cd /opt/Spark-ml-algo-lib
    wget https://github.com/kunpengcompute/Spark-ml-algo-lib/releases/download/v2.2.0-spark2.3.2/Spark-ml-algo-lib-Spark2.3.2.patch
    patch -p1 < Spark-ml-algo-lib-Spark2.3.2.patch
    

    完整的机器学习算法加速库适配代码Spark-ml-algo-lib的目录与仓库代码一致。