我要评分
获取效率
正确性
完整性
易理解

Classification and Regression

Scenarios

Classification and regression analysis is a predictive modeling technology that explores the relationship between labels and features, where labels may be considered as dependent variables, and features as independent variables. Classification and regression algorithms are usually applied in predictive analysis and modeling regression.

Specifically, algorithms such as Linear Regression and Logistic Regression are used for credit risk analysis of Internet finance P2P services and traffic flow prediction for road networks; SVM used for rice forecast for the international carbon financial market and traffic flow prediction; and GBDT and XGBoost used for debt risk rating and warning and travel mode recommendation.

Regression algorithms use multiple iterations to converge approximately to label variables for model training. The Kunpeng BoostKit big data machine learning algorithm library optimizes iteration algorithms and fully exploits the high-concurrency capabilities of Kunpeng processors to reduce the number of iterations during the training process, improving the algorithm performance by multiple times.

Principles

  • GBDT

    Gradient boosting decision tree (GBDT) is a popular decision tree–based ensemble algorithm used for classification and regression tasks. It iteratively trains decision trees to minimize a loss function. Spark GBDT enables binary classification and regression, supports continuous features and categorical features, and uses distributed computing for training and inference in big data scenarios.

  • Random Forest

    The Random Forest algorithm trains multiple decision trees simultaneously to obtain a classification model or regression model based on given sample data that includes feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.

  • SVM

    Support vector machine (SVM) is a generalized linear classifier that performs binary classification on data in a supervised learning manner. Its decision-making boundary is the maximum-margin hyperplane for solving learning samples. SVM is a sparse and robust classifier that uses the hinge loss function to calculate empirical risks and adds regularization items to the problem-solving system to relieve structural risks. The LinearSVC algorithm of Spark introduces two optimization policies: reducing the times of invoking the f functions (distributed computing of loss and gradient of the target functions) through algorithm principle optimization, and accelerating convergence by increasing momentum parameter updates.

  • Decision Tree

    The Decision Tree algorithm is widely used in fields such as machine learning and computer vision for classification and regression. It trains a binary tree to obtain a classification model or regression model based on given sample data that contains feature vectors and label values. The output model can predict the label value with the highest probability after the feature vectors are input.

  • Linear Regression

    Regression algorithms are supervised learning algorithms used to find possible relationships between the independent variable X and the observable variable Y. If the observable variable is continuous, it is called "regression". In machine learning, Linear Regression uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.

  • Logistic Regression

    Logistic Regression is a classification method that uses a linear model to model the relationship between the independent variable X and the observable variable Y. The unknown model parameters are estimated from the training data.

  • XGBoost

    XGBoost is a deeply-optimized distributed gradient boosting algorithm library that is efficient, flexible, and portable. It implements machine learning algorithms in the framework of gradient boosting, and provides a parallel tree boosting algorithm, which can quickly and accurately solve many data science problems.

  • KNN

    K-nearest neighbors (KNN) is a non-parametric algorithm in machine learning that is used to find k samples closest to a given sample. It can be used for classification, regression, and information retrieval.

Programming Example

This example describes the programming with the GBDT algorithm.

GBDT has two types of model interfaces: ML Classification API and ML Regression API.

Model API Type

Function API

ML Classification API

def fit(dataset: Dataset[_]): GBTClassificationModel

def fit(dataset: Dataset[_], paramMap: ParamMap): GBTClassificationModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[GBTClassificationModel]

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): GBTClassificationModel

ML Regression API

def fit(dataset: Dataset[_]): GBTRegressionModel

def fit(dataset: Dataset[_], paramMap: ParamMap): GBTRegressionModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[GBTRegressionModel]

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): GBTRegressionModel

Figure 1 shows the time sequence of the GBDT classification model.

Figure 1 Time sequence of the GBDT classification model

ML Classification API

  • Function description

    Output the GBDT classification model after you input sample data in dataset format and call the training API.

  • Input and output
    1. Package name: package org.apache.spark.ml.classification
    2. Class name: GBTClassifier
    3. Method name: fit
    4. Input: training sample data (Dataset[_]). The following are mandatory fields.

      Parameter

      Value Type

      Default Value

      Description

      labelCol

      Double

      label

      Predicted label

      featuresCol

      Vector

      features

      Feature label

    5. Input: the model parameters of the fit API paramMap, paramMaps, firstParamPair, otherParamPairs, which are described as follows:

      Parameter

      Value Type

      Example

      Description

      paramMap

      ParamMap

      ParamMap(A.c -> b)

      Assigns the value of b to the parameter c of model A.

      paramMaps

      Array[ParamMap]

      Array[ParamMap](n)

      Generates n parameter lists for the ParamMap model.

      firstParamPair

      ParamPair

      ParamPair(A.c, b)

      Assigns the value of b to the parameter c of model A.

      otherParamPairs

      ParamPair

      ParamPair(A.e, f)

      Assigns the value of f to the parameter e of model A.

    6. Parameters optimized based on native algorithms
      def setCheckpointInterval(value: Int): GBTClassifier.this.type
      def setFeatureSubsetStrategy(value: String): GBTClassifier.this.type
      def setFeaturesCol(value: String): GBTClassifier
      def setImpurity(value: String): GBTClassifier.this.type
      def setLabelCol(value: String): GBTClassifier
      def setLossType(value: String): GBTClassifier.this.type
      def setMaxBins(value: Int): GBTClassifier.this.type
      def setMaxDepth(value: Int): GBTClassifier.this.type
      def setMaxIter(value: Int): GBTClassifier.this.type
      def setMinInfoGain(value: Double): GBTClassifier.this.type
      def setMinInstancesPerNode(value: Int): GBTClassifier.this.type
      def setPredictionCol(value: String): GBTClassifier
      def setProbabilityCol(value: String): GBTClassifierdoUseAcc
      def setRawPredictionCol(value: String): GBTClassifier
      def setSeed(value: Long): GBTClassifier.this.type
      def setStepSize(value: Double): GBTClassifier.this.type
      def setSubsamplingRate(value: Double): GBTClassifier.this.type
      def setThresholds(value: Array[Double]): GBTClassifier
    7. Newly added parameters

      Parameter

      Description

      Value Type

      doUseAcc

      Whether to enable the feature parallel training mode

      True/False[Boolean]

      An example is provided as follows:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      import org.apache.spark.ml.param.{ParamMap, ParamPair}
      
      val gbdt = new GBTClassifier()
      // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
      val paramMap = ParamMap(gbdt.maxDepth -> maxDepth)
      .put(gbdt.maxIter, maxIter)
      
      // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
      val paramMaps: Array[ParamMap] = new Array[ParamMap](2)
      for (i <- 0 to  2) {
      paramMaps(i) = ParamMap(gbdt.maxDepth -> maxDepth)
      .put(gbdt.maxIter, maxIter)
      }//Assign a value to paramMaps.
      
      // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
      val maxDepthParamPair = ParamPair(gbdt.maxDepth, maxDepth)
      val maxIterParamPair = ParamPair(gbdt.maxIter, maxIter)
      val maxBinsParamPair = ParamPair(gbdt.maxBins, maxBins)
      
      // Call the fit APIs.
      model = gbdt.fit(trainingData)
      model = gbdt.fit(trainingData, paramMap)
      models = gbdt.fit(trainingData, paramMaps)
      model = gbdt.fit(trainingData, maxDepthParamPair, maxIterParamPair, maxBinsParamPair)
      
    8. Output: GBDT classification model (GBTClassificationModel). The following table lists the field output in model prediction.

      Parameter

      Value Type

      Default Value

      Description

      predictionCol

      Double

      prediction

      Predicted label

  • Example
    fit(dataset: Dataset[_]): GBTClassificationModel example:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}
    import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
    import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
    
    // Load and parse the data file, converting it to a DataFrame.
    val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
    
    // Index labels, adding metadata to the label column.
    // Fit on whole dataset to include all labels in index.
    val labelIndexer = new StringIndexer()
    .setInputCol("label")
    .setOutputCol("indexedLabel")
    .fit(data)
    // Automatically identify categorical features, and index them.
    // Set maxCategories so features with > 4 distinct values are treated as continuous.
    val featureIndexer = new VectorIndexer()
    .setInputCol("features")
    .setOutputCol("indexedFeatures")
    .setMaxCategories(4)
    .fit(data)
    
    // Split the data into training and test sets (30% held out for testing).
    val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
    
    // Train a GBT model.
    val gbt = new GBTClassifier()
    .setLabelCol("indexedLabel")
    .setFeaturesCol("indexedFeatures")
    .setMaxIter(10)
    
    // Convert indexed labels back to original labels.
    val labelConverter = new IndexToString()
    .setInputCol("prediction")
    .setOutputCol("predictedLabel")
    .setLabels(labelIndexer.labels)
    
    // Chain indexers and GBT in a Pipeline.
    val pipeline = new Pipeline()
    .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))
    
    // Train model. This also runs the indexers.
    val model = pipeline.fit(trainingData)
    
    // Make predictions.
    val predictions = model.transform(testData)
    
    // Select (prediction, true label) and compute test error.
    val evaluator = new MulticlassClassificationEvaluator()
    .setLabelCol("indexedLabel")
    .setPredictionCol("prediction")
    .setMetricName("accuracy")
    val accuracy = evaluator.evaluate(predictions)
    println("Test Error = " + (1.0 - accuracy))
    
    val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]
    println("Learned classification GBT model:\n" + gbtModel.toDebugString)
    
  • Result
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    Test Error = 0.0714285714285714
    Learned classification GBT model:
    GBTClassificationModel (uid=gbtc_72086dba9af5) with 10 trees
    Tree 0 (weight 1.0):
    If (feature 406 <= 9.5)
    Predict: 1.0
    Else (feature 406 > 9.5)
    Predict: -1.0
    Tree 1 (weight 0.1):
    If (feature 406 <= 9.5)
    If (feature 209 <= 241.5)
    If (feature 154 <= 55.0)
    Predict: 0.4768116880884702
    Else (feature 154 > 55.0)
    Predict: 0.4768116880884703
    Else (feature 209 > 241.5)
    Predict: 0.47681168808847035
    Else (feature 406 > 9.5)
    If (feature 461 <= 143.5)
    Predict: -0.47681168808847024
    Else (feature 461 > 143.5)
    Predict: -0.47681168808847035
    Tree 2 (weight 0.1):
    If (feature 406 <= 9.5)
    If (feature 657 <= 116.5)
    If (feature 154 <= 9.5)
    Predict: 0.4381935810427206
    Else (feature 154 > 9.5)
    Predict: 0.43819358104272066
    Else (feature 657 > 116.5)
    Predict: 0.43819358104272066
    Else (feature 406 > 9.5)
    If (feature 322 <= 16.0)
    Predict: -0.4381935810427206
    Else (feature 322 > 16.0)
    Predict: -0.4381935810427206
    Tree 3 (weight 0.1):
    If (feature 406 <= 9.5)
    If (feature 598 <= 166.5)
    If (feature 180 <= 3.0)
    Predict: 0.4051496802845983
    Else (feature 180 > 3.0)
    Predict: 0.4051496802845984
    Else (feature 598 > 166.5)
    Predict: 0.4051496802845983
    Else (feature 406 > 9.5)
    Predict: -0.4051496802845983
    Tree 4 (weight 0.1):
    If (feature 406 <= 9.5)
    If (feature 537 <= 47.5)
    If (feature 606 <= 7.0)
    Predict: 0.3765841318352991
    Else (feature 606 > 7.0)
    Predict: 0.37658413183529926
    Else (feature 537 > 47.5)
    Predict: 0.3765841318352994
    Else (feature 406 > 9.5)
    If (feature 124 <= 35.5)
    If (feature 376 <= 1.0)
    If (feature 516 <= 26.5)
    If (feature 266 <= 50.5)
    Predict: -0.3765841318352991
    Else (feature 266 > 50.5)
    Predict: -0.37658413183529915
    Else (feature 516 > 26.5)
    Predict: -0.3765841318352992
    Else (feature 376 > 1.0)
    Predict: -0.3765841318352994
    Else (feature 124 > 35.5)
    Predict: -0.3765841318352994
    Tree 5 (weight 0.1):
    If (feature 406 <= 9.5)
    If (feature 570 <= 3.5)
    Predict: 0.35166478958101005
    Else (feature 570 > 3.5)
    Predict: 0.35166478958101
    Else (feature 406 > 9.5)
    If (feature 266 <= 14.0)
    If (feature 267 <= 12.5)
    Predict: -0.35166478958101005
    Else (feature 267 > 12.5)
    If (feature 267 <= 36.0)
    Predict: -0.35166478958101005
    Else (feature 267 > 36.0)
    Predict: -0.3516647895810101
    Else (feature 266 > 14.0)
    Predict: -0.35166478958101005
    Tree 6 (weight 0.1):
    If (feature 406 <= 9.5)
    If (feature 207 <= 7.5)
    Predict: 0.32974984655529926
    Else (feature 207 > 7.5)
    Predict: 0.3297498465552993
    Else (feature 406 > 9.5)
    If (feature 490 <= 185.0)
    Predict: -0.32974984655529926
    Else (feature 490 > 185.0)
    Predict: -0.3297498465552993
    Tree 7 (weight 0.1):
    If (feature 406 <= 9.5)
    If (feature 568 <= 22.0)
    Predict: 0.3103372455197956
    Else (feature 568 > 22.0)
    Predict: 0.31033724551979563
    Else (feature 406 > 9.5)
    If (feature 379 <= 133.5)
    If (feature 237 <= 250.5)
    Predict: -0.3103372455197956
    Else (feature 237 > 250.5)
    Predict: -0.3103372455197957
    Else (feature 379 > 133.5)
    If (feature 433 <= 183.5)
    If (feature 516 <= 9.0)
    Predict: -0.3103372455197956
    Else (feature 516 > 9.0)
    Predict: -0.3103372455197957
    Else (feature 433 > 183.5)
    Predict: -0.3103372455197957
    Tree 8 (weight 0.1):
    If (feature 406 <= 9.5)
    If (feature 184 <= 19.0)
    Predict: 0.2930291649125433
    Else (feature 184 > 19.0)
    If (feature 155 <= 147.0)
    If (feature 180 <= 3.0)
    Predict: 0.2930291649125433
    Else (feature 180 > 3.0)
    Predict: 0.2930291649125433
    Else (feature 155 > 147.0)
    Predict: 0.2930291649125434
    Else (feature 406 > 9.5)
    If (feature 379 <= 133.5)
    Predict: -0.2930291649125433
    Else (feature 379 > 133.5)
    If (feature 433 <= 52.5)
    Predict: -0.2930291649125433
    Else (feature 433 > 52.5)
    If (feature 462 <= 143.5)
    Predict: -0.2930291649125433
    Else (feature 462 > 143.5)
    Predict: -0.2930291649125434
    Tree 9 (weight 0.1):
    If (feature 406 <= 9.5)
    If (feature 183 <= 3.0)
    Predict: 0.27750666438358246
    Else (feature 183 > 3.0)
    If (feature 183 <= 19.5)
    Predict: 0.27750666438358246
    Else (feature 183 > 19.5)
    Predict: 0.2775066643835825
    Else (feature 406 > 9.5)
    If (feature 239 <= 50.5)
    If (feature 435 <= 102.0)
    Predict: -0.27750666438358246
    Else (feature 435 > 102.0)
    Predict: -0.2775066643835825
    Else (feature 239 > 50.5)
    Predict: -0.27750666438358257
    

ML Regression API

  • Function description

    Output the GBDT regression model after you input sample data in dataset format and call the training API.

  • Input and output
    1. Package name: package org.apache.spark.ml.classification
    2. Class name: GBTRegressor
    3. Method name: fit
    4. Input: training sample data (Dataset[_]). The following are mandatory fields.

      Parameter

      Value Type

      Default Value

      Description

      labelCol

      Double

      label

      Predicted label

      featuresCol

      Vector

      features

      Feature label

    5. Input: the model parameters of the fit API paramMap, paramMaps, firstParamPair, otherParamPairs, which are described as follows:

      Parameter

      Value Type

      Example

      Description

      paramMap

      ParamMap

      ParamMap(A.c -> b)

      Assigns the value of b to the parameter c of model A.

      paramMaps

      Array[ParamMap]

      Array[ParamMap](n)

      Generates n parameter lists for the ParamMap model.

      firstParamPair

      ParamPair

      ParamPair(A.c, b)

      Assigns the value of b to the parameter c of model A.

      otherParamPairs

      ParamPair

      ParamPair(A.e, f)

      Assigns the value of f to the parameter e of model A.

    6. Parameters optimized based on native algorithms
      def setCheckpointInterval(value: Int): GBTRegressor.this.type
      def setFeatureSubsetStrategy(value: String): GBTRegressor.this.type
      def setFeaturesCol(value: String): GBTRegressor
      def setImpurity(value: String): GBTRegressor.this.type
      def setLabelCol(value: String): GBTRegressor
      def setLossType(value: String): GBTRegressor.this.type
      def setMaxBins(value: Int): GBTRegressor.this.type
      def setMaxDepth(value: Int): GBTRegressor.this.type
      def setMaxIter(value: Int): GBTRegressor.this.type
      def setMinInfoGain(value: Double): GBTRegressor.this.type
      def setMinInstancesPerNode(value: Int): GBTRegressor.this.type
      def setPredictionCol(value: String): GBTRegressor
      def setSeed(value: Long): GBTRegressor.this.type
      def setStepSize(value: Double): GBTRegressor.this.type
      def setSubsamplingRate(value: Double): GBTRegressor.this.type

      An example is provided as follows:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      import org.apache.spark.ml.param.{ParamMap, ParamPair}
      
      val gbdt = new GBTRegressor() // Define the regression model.
      
      // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
      val paramMap = ParamMap(gbdt.maxDepth -> maxDepth)
      .put(gbdt.maxIter, maxIter)
      
      // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
      val paramMaps: Array[ParamMap] = new Array[ParamMap](2)
      for (i <- 0 to  2) {
      paramMaps(i) = ParamMap(gbdt.maxDepth -> maxDepth)
      .put(gbdt.maxIter, maxIter)
      }//Assign a value to paramMaps.
      
      // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
      val maxDepthParamPair = ParamPair(gbdt.maxDepth, maxDepth)
      val maxIterParamPair = ParamPair(gbdt.maxIter, maxIter)
      val maxBinsParamPair = ParamPair(gbdt.maxBins, maxBins)
      
      // Call the fit APIs.
      model = gbdt.fit(trainingData)  // Return GBTRegressionModel.
      model = gbdt.fit(trainingData, paramMap)  // Return GBTRegressionModel.
      models = gbdt.fit(trainingData, paramMaps) // Return Seq[GBTRegressionModel].
      model = gbdt.fit(trainingData, maxDepthParamPair, maxIterParamPair, maxBinsParamPair) //Return GBTRegressionModel.
      
    7. Output: GBDT regression model (GBTRegressionModel or Seq[GBTRegressionModel]). The following table lists the field output in model prediction.

      Parameter

      Value Type

      Default Value

      Description

      predictionCol

      Double

      prediction

      Predicted label

      Figure 2 shows the time sequence of the GBDT regression model.

      Figure 2 Time sequence of the GBDT regression model
  • Example
    fit(dataset: Dataset[_]): GBTRegressionModel example:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.ml.evaluation.RegressionEvaluator
    import org.apache.spark.ml.feature.VectorIndexer
    import org.apache.spark.ml.regression.{GBTRegressionModel, GBTRegressor}
    
    // Load and parse the data file, converting it to a DataFrame.
    val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
    
    // Automatically identify categorical features, and index them.
    // Set maxCategories so features with > 4 distinct values are treated as continuous.
    val featureIndexer = new VectorIndexer()
    .setInputCol("features")
    .setOutputCol("indexedFeatures")
    .setMaxCategories(4)
    .fit(data)
    
    // Split the data into training and test sets (30% held out for testing).
    val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
    
    // Train a GBT model.
    val gbt = new GBTRegressor()
    .setLabelCol("label")
    .setFeaturesCol("indexedFeatures")
    .setMaxIter(10)
    
    // Chain indexer and GBT in a Pipeline.
    val pipeline = new Pipeline()
    .setStages(Array(featureIndexer, gbt))
    
    // Train model. This also runs the indexer.
    val model = pipeline.fit(trainingData)
    
    // Make predictions.
    val predictions = model.transform(testData)
    
    // Select example rows to display.
    predictions.select("prediction", "label", "features").show(5)
    
    // Select (prediction, true label) and compute test error.
    val evaluator = new RegressionEvaluator()
    .setLabelCol("label")
    .setPredictionCol("prediction")
    .setMetricName("rmse")
    val rmse = evaluator.evaluate(predictions)
    println("Root Mean Squared Error (RMSE) on test data = " + rmse)
    
    val gbtModel = model.stages(1).asInstanceOf[GBTRegressionModel]
    println("Learned regression GBT model:\n" + gbtModel.toDebugString)
    
  • Result
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    Root Mean Squared Error (RMSE) on test data = 0.0
    Learned regression GBT model:
    GBTRegressionModel (uid=gbtr_842c8acff963) with 10 trees
    Tree 0 (weight 1.0):
    If (feature 434 <= 70.5)
    If (feature 99 in {0.0,3.0})
    Predict: 0.0
    Else (feature 99 not in {0.0,3.0})
    Predict: 1.0
    Else (feature 434 > 70.5)
    Predict: 1.0
    Tree 1 (weight 0.1):
    Predict: 0.0
    Tree 2 (weight 0.1):
    Predict: 0.0
    Tree 3 (weight 0.1):
    Predict: 0.0
    Tree 4 (weight 0.1):
    Predict: 0.0
    Tree 5 (weight 0.1):
    Predict: 0.0
    Tree 6 (weight 0.1):
    Predict: 0.0
    Tree 7 (weight 0.1):
    Predict: 0.0
    Tree 8 (weight 0.1):
    Predict: 0.0
    Tree 9 (weight 0.1):
    Predict: 0.0