Decision Tree

The Decision Tree algorithm provides two types of model APIs: ML Classification API and ML Regression API.

Model API Type	Function API
ML Classification API	def fit(dataset: Dataset[_]): DecisionTreeClassificationModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeClassificationModel]
	def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeClassificationModel
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeClassificationModel
ML Regression API	def fit(dataset: Dataset[_]): DecisionTreeRegressionModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeRegressionModel]
	def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeRegressionModel
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeRegressionModel

ML Classification API

Function description
Output the Decision Tree classification model after you input sample data in dataset format and call the training API.

Input and output

Package name: package org.apache.spark.ml.classification
Class name: DecisionTreeClassifier
Method name: fit
Input: training sample data (Dataset[_]). The following are mandatory fields.
Parameter

Value Type

Default Value

Description

labelCol

Double

label

Label to predict

featuresCol

Vector

features

Feature label

Parameter	Value Type	Default Value	Description
labelCol	Double	label	Label to predict
featuresCol	Vector	features	Feature label

Parameters optimized based on native algorithms

def setCheckpointInterval(value: Int): DecisionTreeClassifier.this.type
Specifies how often to checkpoint the cached node IDs.
def setFeaturesCol(value: String): DecisionTreeClassifier
def setImpurity(value: String): DecisionTreeClassifier.this.type
def setLabelCol(value: String): DecisionTreeClassifier
def setMaxBins(value: Int):DecisionTreeClassifier.this.type
def setMaxDepth(value: Int): DecisionTreeClassifier.this.type
def setMinInfoGain(value: Double): DecisionTreeClassifier.this.type
def setMinInstancesPerNode(value: Int):DecisionTreeClassifier.this.type
def setPredictionCol(value: String): DecisionTreeClassifier
def setProbabilityCol(value: String): DecisionTreeClassifier
def setRawPredictionCol(value: String): DecisionTreeClassifier
def setSeed(value: Long): DecisionTreeClassifier.this.type
def setThresholds(value: Array[Double]): DecisionTreeClassifier

Newly added parameters

Parameter	Description	Value Type
numTrainingDataCopies	Number of training data copies	Integer type. The value must be greater than or equal to 1. The default value is 1.
broadcastVariables	Whether to broadcast variables with large storage space	Boolean type. The default value is false.
numPartsPerTrainingDataCopy	Number of partitions of a single training data copy	Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.
binnedFeaturesDataType	Storage format of features in training sample data	String type. The value can be array (default) or fasthashmap.
copyStrategy	Selection of the copy allocation policy	String type. The value can be normal (default) or plus.
numFeaturesOptFindSplits	Dimension threshold for enabling optimization on searching the high-dimensional feature split point	Integer type. The default value is 8196.

An example is provided as follows:

import org.apache.spark.ml.param.{ParamMap, ParamPair}

val dt= new DecisionTreeClassifier()// Definition

// Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
val paramMap = ParamMap(dt.maxDepth -> maxDepth).put(dt.maxBins, maxBins)

// Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
val paramMaps = new Array[ParamMap](2)
for (i <- 0 to  paramMaps.size) {
paramMaps(i) = ParamMap(dt.maxDepth -> maxDepth)
.put(dt.maxBins, maxBins)
}//Assign a value to paramMaps.

// Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
val firstParamPair= ParamPair(dt.maxDepth, maxDepth1)
val otherParamPairs_1st= ParamPair(dt.maxDepth, maxDepth2)
val otherParamPairs_2nd= ParamPair(dt.maxBins, maxBins)

// Call the fit APIs.
model = dt.fit(trainingData)
model = dt.fit(trainingData, paramMap)
models = dt.fit(trainingData, paramMaps)
model = dt.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd)

Output: Decision Tree classification model (DecisionTreeClassificationModel). The following table lists the fields output in model prediction.

Parameter	Value Type	Default Value	Description
predictionCol	Double	prediction	predictionCol
rawPredictionCol	Vector	rawPrediction	Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction
probabilityCol	Vector	probability	Vector of length # classes equal to rawPrediction normalized to a multinomial distribution

Example

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
 
// Load the data stored in LIBSVM format as a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
 
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
  .fit(data)
 
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
 
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
 
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)
 
// Chain indexers and tree in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))
 
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
 
// Make predictions.
val predictions = model.transform(testData)
 
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
 
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
 
val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println(s"Learned classification tree model:\n ${treeModel.toDebugString}")

Result

+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|           1.0|  1.0|(47236,[270,439,5...|
|           1.0|  1.0|(47236,[3023,6093...|
|          -1.0| -1.0|(47236,[270,391,4...|
|          -1.0| -1.0|(47236,[3718,3723...|
|           1.0|  1.0|(47236,[729,760,1...|
+--------------+-----+--------------------+
only showing top 5 rows
 
Test Error = 0.06476632743800015

ML Regression API

Function description
Output the Decision Tree classification model after you input sample data in dataset format and call the training API.