DecisionTree

There are ML classification and ML regression model APIs for the DecisionTree algorithm.

Model API Type	Function API
ML Classification API	def fit(dataset: Dataset[_]): DecisionTreeClassificationModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeClassificationModel]
	def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeClassificationModel
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeClassificationModel
ML Regression API	def fit(dataset: Dataset[_]): DecisionTreeRegressionModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeRegressionModel]
	def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeRegressionModel
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeRegressionModel

ML Classification API

Function
Import sample data in dataset format, call the training API, and output the decision tree classification model.

Input and output

Package name: package org.apache.spark.ml.classification
Class name: DecisionTreeClassifier
Method name: fit
Input: training sample data (Dataset[_]). Mandatory fields are as follows:
Parameter

Type

Default Value

Description

labelCol

Double

label

Label to predict

featuresCol

Vector

features

Feature label

Parameter	Type	Default Value	Description
labelCol	Double	label	Label to predict
featuresCol	Vector	features	Feature label

Algorithm parameters

Algorithm Parameter
def setCheckpointInterval(value: Int): DecisionTreeClassifier.this.type Specifies how often to checkpoint the cached node IDs. def setFeaturesCol(value: String): DecisionTreeClassifier def setImpurity(value: String): DecisionTreeClassifier.this.type def setLabelCol(value: String): DecisionTreeClassifier def setMaxBins(value: Int):DecisionTreeClassifier.this.type def setMaxDepth(value: Int): DecisionTreeClassifier.this.type def setMinInfoGain(value: Double): DecisionTreeClassifier.this.type def setMinInstancesPerNode(value: Int):DecisionTreeClassifier.this.type def setPredictionCol(value: String): DecisionTreeClassifier def setProbabilityCol(value: String): DecisionTreeClassifier def setRawPredictionCol(value: String): DecisionTreeClassifier def setSeed(value: Long): DecisionTreeClassifier.this.type def setThresholds(value: Array[Double]): DecisionTreeClassifier

Algorithm Parameter

def setCheckpointInterval(value: Int): DecisionTreeClassifier.this.type

Specifies how often to checkpoint the cached node IDs.

def setFeaturesCol(value: String): DecisionTreeClassifier

def setImpurity(value: String): DecisionTreeClassifier.this.type

def setLabelCol(value: String): DecisionTreeClassifier

def setMaxBins(value: Int):DecisionTreeClassifier.this.type

def setMaxDepth(value: Int): DecisionTreeClassifier.this.type

def setMinInfoGain(value: Double): DecisionTreeClassifier.this.type

def setMinInstancesPerNode(value: Int):DecisionTreeClassifier.this.type

def setPredictionCol(value: String): DecisionTreeClassifier

def setProbabilityCol(value: String): DecisionTreeClassifier

def setRawPredictionCol(value: String): DecisionTreeClassifier

def setSeed(value: Long): DecisionTreeClassifier.this.type

def setThresholds(value: Array[Double]): DecisionTreeClassifier

Added algorithm parameters

Parameter	Description	Type
numTrainingDataCopies	Number of training data copies	Integer type. The value must be greater than or equal to 1 (default).
broadcastVariables	Whether to broadcast variables that have large storage space	Boolean type. The default value is false.
numPartsPerTrainingDataCopy	Number of partitions of a single training data copy	Integer type. The value must be greater than or equal to 0 (default, indicating that re-partitioning is not performed).
binnedFeaturesDataType	Storage format of features in training sample data	String type. The value can be array (default) or fasthashmap.
copyStrategy	Selection of the copy allocation policy	String type. The value can be normal (default) or plus.
numFeaturesOptFindSplits	Dimension threshold for enabling optimization on searching the high-dimensional feature split point	Integer type. The default value is 8196.

An example is provided as follows:

import org.apache.spark.ml.param.{ParamMap, ParamPair}

val dt= new DecisionTreeClassifier()// Definition

// Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
val paramMap = ParamMap(dt.maxDepth -> maxDepth).put(dt.maxBins, maxBins)

// Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
val paramMaps = new Array[ParamMap](2)
for (i <- 0 to  paramMaps.size) {
paramMaps(i) = ParamMap(dt.maxDepth -> maxDepth)
.put(dt.maxBins, maxBins)
}// Assign a value to paramMaps.

// Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
val firstParamPair= ParamPair(dt.maxDepth, maxDepth1)
val otherParamPairs_1st= ParamPair(dt.maxDepth, maxDepth2)
val otherParamPairs_2nd= ParamPair(dt.maxBins, maxBins)

// Call the fit APIs.
model = dt.fit(trainingData)
model = dt.fit(trainingData, paramMap)
models = dt.fit(trainingData, paramMaps)
model = dt.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd)

Output: decision tree classification model (DecisionTreeClassificationModel). The following fields are output from model prediction.

Parameter	Type	Default Value	Description
predictionCol	Double	prediction	predictionCol
rawPredictionCol	Vector	rawPrediction	Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction
probabilityCol	Vector	probability	Vector of length # classes equal to rawPrediction normalized to a multinomial distribution

Sample usage

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
 
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
 
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
  .fit(data)
 
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
 
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
 
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)
 
// Chain indexers and tree in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
 
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
 
// Make predictions.
val predictions = model.transform(testData)
 
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
 
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
 
val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println(s"Learned classification tree model:\n ${treeModel.toDebugString}")

Sample result

+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|           1.0|  1.0|(47236,[270,439,5...|
|           1.0|  1.0|(47236,[3023,6093...|
|          -1.0| -1.0|(47236,[270,391,4...|
|          -1.0| -1.0|(47236,[3718,3723...|
|           1.0|  1.0|(47236,[729,760,1...|
+--------------+-----+--------------------+
only showing top 5 rows
 
Test Error = 0.06476632743800015

ML Regression API

Function
Import sample data in dataset format, call the training API, and output the decision tree classification model.

Input and output

Package name: package org.apache.spark.ml.regression
Class name: DecisionTreeClassifier
Method name: fit
Input: training sample data (Dataset[_]). Mandatory fields are as follows:
Parameter

Type

Default Value

Description

labelCol

Double

label

Label to predict

featuresCol

Vector

features

Feature label

Parameter	Type	Default Value	Description
labelCol	Double	label	Label to predict
featuresCol	Vector	features	Feature label

Algorithm parameters

Algorithm Parameter
def setCheckpointInterval(value: Int): DecisionTreeRegressor.this.type Specifies how often to checkpoint the cached node IDs. def setFeaturesCol(value: String): DecisionTreeRegressor def setImpurity(value: String): DecisionTreeRegressor.this.type def setLabelCol(value: String): DecisionTreeRegressor def setMaxBins(value: Int): DecisionTreeRegressor.this.type def setMaxDepth(value: Int): DecisionTreeRegressor.this.type def setMinInfoGain(value: Double): DecisionTreeRegressor.this.type def setMinInstancesPerNode(value: Int): DecisionTreeRegressor.this.type def setPredictionCol(value: String): DecisionTreeRegressor def setSeed(value: Long): DecisionTreeRegressor.this.type def setVarianceCol(value: String): DecisionTreeRegressor.this.type

Algorithm Parameter

def setCheckpointInterval(value: Int): DecisionTreeRegressor.this.type

Specifies how often to checkpoint the cached node IDs.

def setFeaturesCol(value: String): DecisionTreeRegressor

def setImpurity(value: String): DecisionTreeRegressor.this.type

def setLabelCol(value: String): DecisionTreeRegressor

def setMaxBins(value: Int): DecisionTreeRegressor.this.type

def setMaxDepth(value: Int): DecisionTreeRegressor.this.type

def setMinInfoGain(value: Double): DecisionTreeRegressor.this.type

def setMinInstancesPerNode(value: Int): DecisionTreeRegressor.this.type

def setPredictionCol(value: String): DecisionTreeRegressor

def setSeed(value: Long): DecisionTreeRegressor.this.type

def setVarianceCol(value: String): DecisionTreeRegressor.this.type

Added algorithm parameters

Parameter	Description	Type
numTrainingDataCopies	Number of training data copies	Integer type. The value must be greater than or equal to 1 (default).
broadcastVariables	Whether to broadcast variables that have large storage space	Boolean type. The default value is false.
numPartsPerTrainingDataCopy	Number of partitions of a single training data copy	Integer type. The value must be greater than or equal to 0 (default, indicating that re-partitioning is not performed).
binnedFeaturesDataType	Storage format of features in training sample data	String type. The value can be array (default) or fasthashmap.
copyStrategy	Selection of the copy allocation policy	String type. The value can be normal (default) or plus.
numFeaturesOptFindSplits	Dimension threshold for enabling optimization on searching the high-dimensional feature split point	Integer type. The default value is 8196.

An example is provided as follows:

import org.apache.spark.ml.param.{ParamMap, ParamPair}

val rf= new DecisionTreeClassifier()// Definition

// Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
val paramMap = ParamMap(dt.maxDepth -> maxDepth).put(dt.maxBins, maxBins)

// Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
val paramMaps = new Array[ParamMap](2)
for (i <- 0 to  paramMaps.size) {
paramMaps(i) = ParamMap(dt.maxDepth -> maxDepth)
.put(dt.maxBins, maxBins)
}// Assign a value to paramMaps.

// Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
val firstParamPair= ParamPair(dt.maxDepth, maxDepth1)
val otherParamPairs_1st= ParamPair(dt.maxDepth, maxDepth2)
val otherParamPairs_2nd= ParamPair(dt.maxBins, maxBins)

// Call the fit APIs.
model = dt.fit(trainingData)
model = dt.fit(trainingData, paramMap)
models = dt.fit(trainingData, paramMaps)
model = dt.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd)

Output: decision tree regression model (DecisionTreeRegressionModel). The following fields are output from model prediction.
Parameter

Type

Default Value

Description

predictionCol

Double

prediction

Predicted label

varianceCol

Double

-

The biased sample variance of prediction

Parameter	Type	Default Value	Description
predictionCol	Double	prediction	Predicted label
varianceCol	Double	-	The biased sample variance of prediction

Sample usage

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor
 
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
 
// Automatically identify categorical features, and index them.
// Here, we treat features with > 4 distinct values as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)
 
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
 
// Train a DecisionTree model.
val dt = new DecisionTreeRegressor()
  .setLabelCol("label")
  .setFeaturesCol("indexedFeatures")
 
// Chain indexer and tree in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(featureIndexer, dt))
 
// Train model. This also runs the indexer.
val model = pipeline.fit(trainingData)
 
// Make predictions.
val predictions = model.transform(testData)
 
// Select example rows to display.
predictions.select("prediction", "label", "features").show(5)
 
// Select (prediction, true label) and compute test error.
val evaluator = new RegressionEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("rmse")
val rmse = evaluator.evaluate(predictions)
println(s"Root Mean Squared Error (RMSE) on test data = $rmse")
 
val treeModel = model.stages(1).asInstanceOf[DecisionTreeRegressionModel]
println(s"Learned regression tree model:\n ${treeModel.toDebugString}")

Sample result

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|      0.51|  0.3|(1000,[0,1,2,3,4,...|
+----------+-----+--------------------+
 
Root Mean Squared Error (RMSE) on test data = 0.21000000000000002

Parent topic: Developing an Application