Decision Tree
The Decision Tree algorithm provides two types of model APIs: ML Classification API and ML Regression API.
Model API Type |
Function API |
|---|---|
ML Classification API |
def fit(dataset: Dataset[_]): DecisionTreeClassificationModel |
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeClassificationModel] |
|
def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeClassificationModel |
|
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeClassificationModel |
|
ML Regression API |
def fit(dataset: Dataset[_]): DecisionTreeRegressionModel |
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[DecisionTreeRegressionModel] |
|
def fit(dataset: Dataset[_], paramMap: ParamMap): DecisionTreeRegressionModel |
|
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DecisionTreeRegressionModel |
ML Classification API
- Input and output
- Package name: package org.apache.spark.ml.classification
- Class name: DecisionTreeClassifier
- Method name: fit
- Input: training sample data (Dataset[_]). The following are mandatory fields.
Parameter
Value Type
Default Value
Description
labelCol
Double
label
Label to predict
featuresCol
Vector
features
Feature label
- Parameters optimized based on native algorithms
def setCheckpointInterval(value: Int): DecisionTreeClassifier.this.type Specifies how often to checkpoint the cached node IDs. def setFeaturesCol(value: String): DecisionTreeClassifier def setImpurity(value: String): DecisionTreeClassifier.this.type def setLabelCol(value: String): DecisionTreeClassifier def setMaxBins(value: Int):DecisionTreeClassifier.this.type def setMaxDepth(value: Int): DecisionTreeClassifier.this.type def setMinInfoGain(value: Double): DecisionTreeClassifier.this.type def setMinInstancesPerNode(value: Int):DecisionTreeClassifier.this.type def setPredictionCol(value: String): DecisionTreeClassifier def setProbabilityCol(value: String): DecisionTreeClassifier def setRawPredictionCol(value: String): DecisionTreeClassifier def setSeed(value: Long): DecisionTreeClassifier.this.type def setThresholds(value: Array[Double]): DecisionTreeClassifier
- Newly added parameters
Parameter
Description
Value Type
numTrainingDataCopies
Number of training data copies
Integer type. The value must be greater than or equal to 1. The default value is 1.
broadcastVariables
Whether to broadcast variables with large storage space
Boolean type. The default value is false.
numPartsPerTrainingDataCopy
Number of partitions of a single training data copy
Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.
binnedFeaturesDataType
Storage format of features in training sample data
String type. The value can be array (default) or fasthashmap.
copyStrategy
Selection of the copy allocation policy
String type. The value can be normal (default) or plus.
numFeaturesOptFindSplits
Dimension threshold for enabling optimization on searching the high-dimensional feature split point
Integer type. The default value is 8196.
An example is provided as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
import org.apache.spark.ml.param.{ParamMap, ParamPair} val dt= new DecisionTreeClassifier()// Definition // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter. val paramMap = ParamMap(dt.maxDepth -> maxDepth).put(dt.maxBins, maxBins) // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter. val paramMaps = new Array[ParamMap](2) for (i <- 0 to paramMaps.size) { paramMaps(i) = ParamMap(dt.maxDepth -> maxDepth) .put(dt.maxBins, maxBins) }//Assign a value to paramMaps. // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter. val firstParamPair= ParamPair(dt.maxDepth, maxDepth1) val otherParamPairs_1st= ParamPair(dt.maxDepth, maxDepth2) val otherParamPairs_2nd= ParamPair(dt.maxBins, maxBins) // Call the fit APIs. model = dt.fit(trainingData) model = dt.fit(trainingData, paramMap) models = dt.fit(trainingData, paramMaps) model = dt.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd)
- Output: Decision Tree classification model (DecisionTreeClassificationModel). The following table lists the fields output in model prediction.
Parameter
Value Type
Default Value
Description
predictionCol
Double
prediction
predictionCol
rawPredictionCol
Vector
rawPrediction
Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction
probabilityCol
Vector
probability
Vector of length # classes equal to rawPrediction normalized to a multinomial distribution
- Example
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.DecisionTreeClassificationModel import org.apache.spark.ml.classification.DecisionTreeClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} // Load the data stored in LIBSVM format as a DataFrame. val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") // Index labels, adding metadata to the label column. // Fit on whole dataset to include all labels in index. val labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(data) // Automatically identify categorical features, and index them. val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) // features with > 4 distinct values are treated as continuous. .fit(data) // Split the data into training and test sets (30% held out for testing). val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) // Train a DecisionTree model. val dt = new DecisionTreeClassifier() .setLabelCol("indexedLabel") .setFeaturesCol("indexedFeatures") // Convert indexed labels back to original labels. val labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer.labels) // Chain indexers and tree in a Pipeline. val pipeline = new Pipeline() .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter)) // Train model. This also runs the indexers. val model = pipeline.fit(trainingData) // Make predictions. val predictions = model.transform(testData) // Select example rows to display. predictions.select("predictedLabel", "label", "features").show(5) // Select (prediction, true label) and compute test error. val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("indexedLabel") .setPredictionCol("prediction") .setMetricName("accuracy") val accuracy = evaluator.evaluate(predictions) println(s"Test Error = ${(1.0 - accuracy)}") val treeModel = model.stages(2).asInstanceOf[DecisionTreeClassificationModel] println(s"Learned classification tree model:\n ${treeModel.toDebugString}") - Result
1 2 3 4 5 6 7 8 9 10 11 12
+--------------+-----+--------------------+ |predictedLabel|label| features| +--------------+-----+--------------------+ | 1.0| 1.0|(47236,[270,439,5...| | 1.0| 1.0|(47236,[3023,6093...| | -1.0| -1.0|(47236,[270,391,4...| | -1.0| -1.0|(47236,[3718,3723...| | 1.0| 1.0|(47236,[729,760,1...| +--------------+-----+--------------------+ only showing top 5 rows Test Error = 0.06476632743800015
ML Regression API
- Function description
Output the Decision Tree classification model after you input sample data in dataset format and call the training API.
- Input and output
- Package name: package org.apache.spark.ml.regression
- Class name: DecisionTreeClassifier
- Method name: fit
- Input: training sample data (Dataset[_]). The following are mandatory fields.
Parameter
Value Type
Default Value
Description
labelCol
Double
label
Label to predict
featuresCol
Vector
features
Feature label
- Parameters optimized based on native algorithms
def setCheckpointInterval(value: Int): DecisionTreeRegressor.this.type Specifies how often to checkpoint the cached node IDs. def setFeaturesCol(value: String): DecisionTreeRegressor def setImpurity(value: String): DecisionTreeRegressor.this.type def setLabelCol(value: String): DecisionTreeRegressor def setMaxBins(value: Int): DecisionTreeRegressor.this.type def setMaxDepth(value: Int): DecisionTreeRegressor.this.type def setMinInfoGain(value: Double): DecisionTreeRegressor.this.type def setMinInstancesPerNode(value: Int): DecisionTreeRegressor.this.type def setPredictionCol(value: String): DecisionTreeRegressor def setSeed(value: Long): DecisionTreeRegressor.this.type def setVarianceCol(value: String): DecisionTreeRegressor.this.type
- Newly added parameters
Parameter
Description
Value Type
numTrainingDataCopies
Number of training data copies
Integer type. The value must be greater than or equal to 1. The default value is 1.
broadcastVariables
Whether to broadcast variables with large storage space
Boolean type. The default value is false.
numPartsPerTrainingDataCopy
Number of partitions of a single training data copy
Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.
binnedFeaturesDataType
Storage format of features in training sample data
String type. The value can be array (default) or fasthashmap.
copyStrategy
Selection of the copy allocation policy
String type. The value can be normal (default) or plus.
numFeaturesOptFindSplits
Dimension threshold for enabling optimization on searching the high-dimensional feature split point
Integer type. The default value is 8196.
An example is provided as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
import org.apache.spark.ml.param.{ParamMap, ParamPair} val rf= new DecisionTreeClassifier()// Definition // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter. val paramMap = ParamMap(dt.maxDepth -> maxDepth).put(dt.maxBins, maxBins) // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter. val paramMaps = new Array[ParamMap](2) for (i <- 0 to paramMaps.size) { paramMaps(i) = ParamMap(dt.maxDepth -> maxDepth) .put(dt.maxBins, maxBins) }//Assign a value to paramMaps. // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter. val firstParamPair= ParamPair(dt.maxDepth, maxDepth1) val otherParamPairs_1st= ParamPair(dt.maxDepth, maxDepth2) val otherParamPairs_2nd= ParamPair(dt.maxBins, maxBins) // Call the fit APIs. model = dt.fit(trainingData) model = dt.fit(trainingData, paramMap) models = dt.fit(trainingData, paramMaps) model = dt.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd)
- Output: Decision Tree regression model (DecisionTreeRegressionModel). The following table lists the fields output in model prediction.
Parameter
Value Type
Default Value
Description
predictionCol
Double
prediction
Predicted label
varianceCol
Double
-
The biased sample variance of prediction
- Example
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.regression.DecisionTreeRegressionModel import org.apache.spark.ml.regression.DecisionTreeRegressor // Load the data stored in LIBSVM format as a DataFrame. val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") // Automatically identify categorical features, and index them. // Here, we treat features with > 4 distinct values as continuous. val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(data) // Split the data into training and test sets (30% held out for testing). val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) // Train a DecisionTree model. val dt = new DecisionTreeRegressor() .setLabelCol("label") .setFeaturesCol("indexedFeatures") // Chain indexer and tree in a Pipeline. val pipeline = new Pipeline() .setStages(Array(featureIndexer, dt)) // Train model. This also runs the indexer. val model = pipeline.fit(trainingData) // Make predictions. val predictions = model.transform(testData) // Select example rows to display. predictions.select("prediction", "label", "features").show(5) // Select (prediction, true label) and compute test error. val evaluator = new RegressionEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("rmse") val rmse = evaluator.evaluate(predictions) println(s"Root Mean Squared Error (RMSE) on test data = $rmse") val treeModel = model.stages(1).asInstanceOf[DecisionTreeRegressionModel] println(s"Learned regression tree model:\n ${treeModel.toDebugString}") - Result
1 2 3 4 5 6 7
+----------+-----+--------------------+ |prediction|label| features| +----------+-----+--------------------+ | 0.51| 0.3|(1000,[0,1,2,3,4,...| +----------+-----+--------------------+ Root Mean Squared Error (RMSE) on test data = 0.21000000000000002