Rate This Document
Findability
Accuracy
Completeness
Readability

Decision Tree

The Decision Tree algorithm provides two types of model APIs: ML Classification API and ML Regression API.

Model API Type

Function API

ML Classification API

def fit(dataset: Dataset[_]): DecisionTreeClassificationModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeClassificationModel]

def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeClassificationModel

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeClassificationModel

ML Regression API

def fit(dataset: Dataset[_]): DecisionTreeRegressionModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeRegressionModel]

def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeRegressionModel

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeRegressionModel

ML Classification API

  • Function description

    Output the Decision Tree classification model after you input sample data in dataset format and call the training API.

  • Input and output
    1. Package name: package org.apache.spark.ml.classification
    2. Class name: DecisionTreeClassifier
    3. Method name: fit
    4. Input: training sample data (Dataset[_]). The following are mandatory fields.

      Parameter

      Value Type

      Default Value

      Description

      labelCol

      Double

      label

      Label to predict

      featuresCol

      Vector

      features

      Feature label

    5. Parameters optimized based on native algorithms
      def setCheckpointInterval(value: Int): DecisionTreeClassifier.this.type
      Specifies how often to checkpoint the cached node IDs.
      def setFeaturesCol(value: String): DecisionTreeClassifier
      def setImpurity(value: String): DecisionTreeClassifier.this.type
      def setLabelCol(value: String): DecisionTreeClassifier
      def setMaxBins(value: Int):DecisionTreeClassifier.this.type
      def setMaxDepth(value: Int): DecisionTreeClassifier.this.type
      def setMinInfoGain(value: Double): DecisionTreeClassifier.this.type
      def setMinInstancesPerNode(value: Int):DecisionTreeClassifier.this.type
      def setPredictionCol(value: String): DecisionTreeClassifier
      def setProbabilityCol(value: String): DecisionTreeClassifier
      def setRawPredictionCol(value: String): DecisionTreeClassifier
      def setSeed(value: Long): DecisionTreeClassifier.this.type
      def setThresholds(value: Array[Double]): DecisionTreeClassifier
    6. Newly added parameters

      Parameter

      Description

      Value Type

      numTrainingDataCopies

      Number of training data copies

      Integer type. The value must be greater than or equal to 1. The default value is 1.

      broadcastVariables

      Whether to broadcast variables with large storage space

      Boolean type. The default value is false.

      numPartsPerTrainingDataCopy

      Number of partitions of a single training data copy

      Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.

      binnedFeaturesDataType

      Storage format of features in training sample data

      String type. The value can be array (default) or fasthashmap.

      copyStrategy

      Selection of the copy allocation policy

      String type. The value can be normal (default) or plus.

      numFeaturesOptFindSplits

      Dimension threshold for enabling optimization on searching the high-dimensional feature split point

      Integer type. The default value is 8196.

      An example is provided as follows:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      import org.apache.spark.ml.param.{ParamMap, ParamPair}
      
      val dt= new DecisionTreeClassifier()// Definition
      
      // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
      val paramMap = ParamMap(dt.maxDepth -> maxDepth).put(dt.maxBins, maxBins)
      
      // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
      val paramMaps = new Array[ParamMap](2)
      for (i <- 0 to  paramMaps.size) {
      paramMaps(i) = ParamMap(dt.maxDepth -> maxDepth)
      .put(dt.maxBins, maxBins)
      }//Assign a value to paramMaps.
      
      // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
      val firstParamPair= ParamPair(dt.maxDepth, maxDepth1)
      val otherParamPairs_1st= ParamPair(dt.maxDepth, maxDepth2)
      val otherParamPairs_2nd= ParamPair(dt.maxBins, maxBins)
      
      // Call the fit APIs.
      model = dt.fit(trainingData)
      model = dt.fit(trainingData, paramMap)
      models = dt.fit(trainingData, paramMaps)
      model = dt.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd)
      
    7. Output: Decision Tree classification model (DecisionTreeClassificationModel). The following table lists the fields output in model prediction.

      Parameter

      Value Type

      Default Value

      Description

      predictionCol

      Double

      prediction

      predictionCol

      rawPredictionCol

      Vector

      rawPrediction

      Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction

      probabilityCol

      Vector

      probability

      Vector of length # classes equal to rawPrediction normalized to a multinomial distribution

  • Example
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.ml.classification.DecisionTreeClassificationModel
    import org.apache.spark.ml.classification.DecisionTreeClassifier
    import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
    import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
     
    // Load the data stored in LIBSVM format as a DataFrame.
    val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
     
    // Index labels, adding metadata to the label column.
    // Fit on whole dataset to include all labels in index.
    val labelIndexer = new StringIndexer()
      .setInputCol("label")
      .setOutputCol("indexedLabel")
      .fit(data)
    // Automatically identify categorical features, and index them.
    val featureIndexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
      .fit(data)
     
    // Split the data into training and test sets (30% held out for testing).
    val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
     
    // Train a DecisionTree model.
    val dt = new DecisionTreeClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")
     
    // Convert indexed labels back to original labels.
    val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)
     
    // Chain indexers and tree in a Pipeline.
    val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
     
    // Train model. This also runs the indexers.
    val model = pipeline.fit(trainingData)
     
    // Make predictions.
    val predictions = model.transform(testData)
     
    // Select example rows to display.
    predictions.select("predictedLabel", "label", "features").show(5)
     
    // Select (prediction, true label) and compute test error.
    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("indexedLabel")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")
    val accuracy = evaluator.evaluate(predictions)
    println(s"Test Error = ${(1.0 - accuracy)}")
     
    val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
    println(s"Learned classification tree model:\n ${treeModel.toDebugString}")
  • Result
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    +--------------+-----+--------------------+
    |predictedLabel|label|            features|
    +--------------+-----+--------------------+
    |           1.0|  1.0|(47236,[270,439,5...|
    |           1.0|  1.0|(47236,[3023,6093...|
    |          -1.0| -1.0|(47236,[270,391,4...|
    |          -1.0| -1.0|(47236,[3718,3723...|
    |           1.0|  1.0|(47236,[729,760,1...|
    +--------------+-----+--------------------+
    only showing top 5 rows
     
    Test Error = 0.06476632743800015
    

ML Regression API

  • Function description

    Output the Decision Tree classification model after you input sample data in dataset format and call the training API.

  • Input and output
    1. Package name: package org.apache.spark.ml.regression
    2. Class name: DecisionTreeClassifier
    3. Method name: fit
    4. Input: training sample data (Dataset[_]). The following are mandatory fields.

      Parameter

      Value Type

      Default Value

      Description

      labelCol

      Double

      label

      Label to predict

      featuresCol

      Vector

      features

      Feature label

    5. Parameters optimized based on native algorithms
      def setCheckpointInterval(value: Int): DecisionTreeRegressor.this.type
      Specifies how often to checkpoint the cached node IDs.
      def setFeaturesCol(value: String): DecisionTreeRegressor
      def setImpurity(value: String): DecisionTreeRegressor.this.type
      def setLabelCol(value: String): DecisionTreeRegressor
      def setMaxBins(value: Int): DecisionTreeRegressor.this.type
      def setMaxDepth(value: Int): DecisionTreeRegressor.this.type
      def setMinInfoGain(value: Double): DecisionTreeRegressor.this.type
      def setMinInstancesPerNode(value: Int): DecisionTreeRegressor.this.type
      def setPredictionCol(value: String): DecisionTreeRegressor
      def setSeed(value: Long): DecisionTreeRegressor.this.type
      def setVarianceCol(value: String): DecisionTreeRegressor.this.type
    6. Newly added parameters

      Parameter

      Description

      Value Type

      numTrainingDataCopies

      Number of training data copies

      Integer type. The value must be greater than or equal to 1. The default value is 1.

      broadcastVariables

      Whether to broadcast variables with large storage space

      Boolean type. The default value is false.

      numPartsPerTrainingDataCopy

      Number of partitions of a single training data copy

      Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.

      binnedFeaturesDataType

      Storage format of features in training sample data

      String type. The value can be array (default) or fasthashmap.

      copyStrategy

      Selection of the copy allocation policy

      String type. The value can be normal (default) or plus.

      numFeaturesOptFindSplits

      Dimension threshold for enabling optimization on searching the high-dimensional feature split point

      Integer type. The default value is 8196.

      An example is provided as follows:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      import org.apache.spark.ml.param.{ParamMap, ParamPair}
      
      val rf= new DecisionTreeClassifier()// Definition
      
      // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
      val paramMap = ParamMap(dt.maxDepth -> maxDepth).put(dt.maxBins, maxBins)
      
      // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
      val paramMaps = new Array[ParamMap](2)
      for (i <- 0 to  paramMaps.size) {
      paramMaps(i) = ParamMap(dt.maxDepth -> maxDepth)
      .put(dt.maxBins, maxBins)
      }//Assign a value to paramMaps.
      
      // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
      val firstParamPair= ParamPair(dt.maxDepth, maxDepth1)
      val otherParamPairs_1st= ParamPair(dt.maxDepth, maxDepth2)
      val otherParamPairs_2nd= ParamPair(dt.maxBins, maxBins)
      
      // Call the fit APIs.
      model = dt.fit(trainingData)
      model = dt.fit(trainingData, paramMap)
      models = dt.fit(trainingData, paramMaps)
      model = dt.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd)
      
    7. Output: Decision Tree regression model (DecisionTreeRegressionModel). The following table lists the fields output in model prediction.

      Parameter

      Value Type

      Default Value

      Description

      predictionCol

      Double

      prediction

      Predicted label

      varianceCol

      Double

      -

      The biased sample variance of prediction

  • Example
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.ml.evaluation.RegressionEvaluator
    import org.apache.spark.ml.feature.VectorIndexer
    import org.apache.spark.ml.regression.DecisionTreeRegressionModel
    import org.apache.spark.ml.regression.DecisionTreeRegressor
     
    // Load the data stored in LIBSVM format as a DataFrame.
    val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
     
    // Automatically identify categorical features, and index them.
    // Here, we treat features with > 4 distinct values as continuous.
    val featureIndexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(4)
      .fit(data)
     
    // Split the data into training and test sets (30% held out for testing).
    val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
     
    // Train a DecisionTree model.
    val dt = new DecisionTreeRegressor()
      .setLabelCol("label")
      .setFeaturesCol("indexedFeatures")
     
    // Chain indexer and tree in a Pipeline.
    val pipeline = new Pipeline()
      .setStages(Array(featureIndexer, dt))
     
    // Train model. This also runs the indexer.
    val model = pipeline.fit(trainingData)
     
    // Make predictions.
    val predictions = model.transform(testData)
     
    // Select example rows to display.
    predictions.select("prediction", "label", "features").show(5)
     
    // Select (prediction, true label) and compute test error.
    val evaluator = new RegressionEvaluator()
      .setLabelCol("label")
      .setPredictionCol("prediction")
      .setMetricName("rmse")
    val rmse = evaluator.evaluate(predictions)
    println(s"Root Mean Squared Error (RMSE) on test data = $rmse")
     
    val treeModel = model.stages(1).asInstanceOf[DecisionTreeRegressionModel]
    println(s"Learned regression tree model:\n ${treeModel.toDebugString}")
  • Result
    1
    2
    3
    4
    5
    6
    7
    +----------+-----+--------------------+
    |prediction|label|            features|
    +----------+-----+--------------------+
    |      0.51|  0.3|(1000,[0,1,2,3,4,...|
    +----------+-----+--------------------+
     
    Root Mean Squared Error (RMSE) on test data = 0.21000000000000002