Random Forest
The Random Forest algorithm provides two types of model APIs: ML Classification API and ML Regression API.
Model API Type |
Function API |
|---|---|
ML Classification API |
def fit(dataset: Dataset[_]): RandomForestClassificationModel |
def fit(dataset: Dataset[_], paramMap: ParamMap): RandomForestClassificationModel |
|
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[RandomForestClassificationModel] |
|
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): RandomForestClassificationModel |
|
ML Regression API |
def fit(dataset: Dataset[_]): RandomForestClassificationModel |
def fit(dataset: Dataset[_], paramMap: ParamMap): GBTRegressionModel |
|
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[RandomForestClassificationModel] |
|
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): RandomForestClassificationModel |
ML Classification API
- Function description
Output the Random Forest classification model after you input sample data in dataset format and call the training API.
- Input and output
- Package name: package org.apache.spark.ml.classification
- Class name: RandomForestClassifier
- Method name: fit
- Input: training sample data (Dataset[_]). The following are mandatory fields.
Parameter
Value Type
Default Value
Description
labelCol
Double
label
Predicted label
featuresCol
Vector
features
Feature label
- Input: the model parameters of the fit API paramMap, paramMaps, firstParamPair, otherParamPairs, which are described as follows:
Parameter
Value Type
Example
Description
paramMap
ParamMap
ParamMap(A.c -> b)
Assigns the value of b to the parameter c of model A.
paramMaps
Array[ParamMap]
Array[ParamMap](n)
Generates n parameter lists for the ParamMap model.
firstParamPair
ParamPair
ParamPair(A.c, b)
Assigns the value of b to the parameter c of model A.
otherParamPairs
ParamPair
ParamPair(A.e, f)
Assigns the value of f to the parameter e of model A.
- Parameters optimized based on native algorithms
def setCheckpointInterval(value: Int): RandomForestClassifier.this.type def setFeatureSubsetStrategy(value: String): RandomForestClassifier.this.type def setFeaturesCol(value: String): RandomForestClassifier def setImpurity(value: String): RandomForestClassifier.this.type def setLabelCol(value: String): RandomForestClassifier def setMaxBins(value: Int): RandomForestClassifier.this.type def setMaxDepth(value: Int): RandomForestClassifier.this.type def setMinInfoGain(value: Double): RandomForestClassifier.this.type def setMinInstancesPerNode(value: Int): RandomForestClassifier.this.type def setNumTrees(value: Int): RandomForestClassifier.this.type def setPredictionCol(value: String): RandomForestClassifier def setProbabilityCol(value: String): RandomForestClassifier def setRawPredictionCol(value: String): RandomForestClassifier def setSeed(value: Long): RandomForestClassifier.this.type def setSubsamplingRate(value: Double): RandomForestClassifier.this.type def setThresholds(value: Array[Double]): RandomForestClassifier
- Newly added parameters
Parameter
spark conf Parameter
Description
Value Type
numTrainingDataCopies
spark.boostkit.ml.rf.numTrainingDataCopies
Number of training data copies
Integer type. The value must be greater than or equal to 1. The default value is 1.
broadcastVariables
spark.boostkit.ml.rf.broadcastVariables
Whether to broadcast variables with large storage space
Boolean type. The default value is false.
numPartsPerTrainingDataCopy
spark.boostkit.ml.rf.numPartsPerTrainingDataCopy
Number of partitions of a single training data copy
Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.
binnedFeaturesDataType
spark.boostkit.ml.rf.binnedFeaturesDataType
Storage format of features in training sample data
String type. The supported storage formats are array (default) and fasthashmap.
An example is provided as follows:import org.apache.spark.ml.param.{ParamMap, ParamPair} val rf= new RandomForestClassifier()// Define the random forest classifier. // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter. val paramMap = ParamMap(rf.maxDepth -> maxDepth).put(rf.numTrees, numTrees) // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter. val paramMaps = new Array[ParamMap](2) for (i <- 0 to paramMaps.size) { paramMaps(i) = ParamMap(rf.maxDepth -> maxDepth) .put(rf.numTrees, numTrees) }//Assign a value to paramMaps. // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter. val firstParamPair= ParamPair(rf.maxDepth, maxDepth1) val otherParamPairs_1st= ParamPair(rf.maxDepth, maxDepth2) val otherParamPairs_2nd= ParamPair(rf.numTrees, numTrees) // Call the fit APIs. model = rf.fit(trainingData) model = rf.fit(trainingData, paramMap) models = rf.fit(trainingData, paramMaps) model = rf.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd) - Example code for setting the new parameters
// spark: SparkSession spark.conf.set("spark.boostkit.ml.rf.binnedFeaturesDataType", binnedFeaturesType) spark.conf.set("spark.boostkit.ml.rf.numTrainingDataCopies", numCopiesInput) spark.conf.set("spark.boostkit.ml.rf.numPartsPerTrainingDataCopy", pt) spark.conf.set("spark.boostkit.ml.rf.broadcastVariables", bcVariables) val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") - Output: Random Forest classification model (RFClassificationModel). The following table lists the field output in model prediction.
Parameter
Value Type
Default Value
Description
predictionCol
Double
prediction
Predicted label
- Examplefit(dataset: Dataset[_]): RFClassificationModel example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer} // Load and parse the data file, converting it to a DataFrame. val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") // Index labels, adding metadata to the label column. // Fit on whole dataset to include all labels in index. val labelIndexer = new StringIndexer() .setInputCol("label") .setOutputCol("indexedLabel") .fit(data) // Automatically identify categorical features, and index them. // Set maxCategories so features with > 4 distinct values are treated as continuous. val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(data) // Split the data into training and test sets (30% held out for testing). val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) // Train a RandomForest model. val rf = new RandomForestClassifier() .setLabelCol("indexedLabel") .setFeaturesCol("indexedFeatures") .setNumTrees(10) // Convert indexed labels back to original labels. val labelConverter = new IndexToString() .setInputCol("prediction") .setOutputCol("predictedLabel") .setLabels(labelIndexer.labels) // Chain indexers and forest in a Pipeline. val pipeline = new Pipeline() .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter)) // Train model. This also runs the indexers. val model = pipeline.fit(trainingData) // Make predictions. val predictions = model.transform(testData) // Select example rows to display. predictions.select("predictedLabel", "label", "features").show(5) // Select (prediction, true label) and compute test error. val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("indexedLabel") .setPredictionCol("prediction") .setMetricName("accuracy") val accuracy = evaluator.evaluate(predictions) println(s"Test Error = ${(1.0 - accuracy)}") val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel] println(s"Learned classification forest model:\n ${rfModel.toDebugString}")
- Result
1 2 3 4 5 6 7 8 9 10 11 12
+--------------+-----+--------------------+ |predictedLabel|label| features| +--------------+-----+--------------------+ | 1.0| 1.0|(47236,[270,439,5...| | 1.0| 1.0|(47236,[3023,6093...| | -1.0| -1.0|(47236,[270,391,4...| | -1.0| -1.0|(47236,[3718,3723...| | 1.0| 1.0|(47236,[729,760,1...| +--------------+-----+--------------------+ only showing top 5 rows Test Error = 0.06476632743800015
ML Regression API
- Input and output
- Package name: package org.apache.spark.mllib.tree
- Class name: RandomForestRegressor
- Method name: fit
- Input: training sample data (Dataset[_]). The following are mandatory fields.
Parameter
Value Type
Default Value
Description
labelCol
Double
label
Predicted label
featuresCol
Vector
features
Feature label
- Input: the model parameters of the fit API paramMap, paramMaps, firstParamPair, otherParamPairs, which are described as follows:
Parameter
Value Type
Example
Description
paramMap
ParamMap
ParamMap(A.c -> b)
Assigns the value of b to the parameter c of model A.
paramMaps
Array[ParamMap]
Array[ParamMap](n)
Generates n parameter lists for the ParamMap model.
firstParamPair
ParamPair
ParamPair(A.c, b)
Assigns the value of b to the parameter c of model A.
otherParamPairs
ParamPair
ParamPair(A.e, f)
Assigns the value of f to the parameter e of model A.
- Parameters optimized based on native algorithms
def setCheckpointInterval(value: Int): RandomForestRegressor.this.type def setFeatureSubsetStrategy(value: String): RandomForestRegressor.this.type def setFeaturesCol(value: String): RandomForestRegressor def setImpurity(value: String): RandomForestRegressor.this.type def setLabelCol(value: String): RandomForestRegressor def setMaxBins(value: Int): RandomForestRegressor.this.type def setMaxDepth(value: Int): RandomForestRegressor.this.type def setMinInfoGain(value: Double): RandomForestRegressor.this.type def setMinInstancesPerNode(value: Int): RandomForestRegressor.this.type def setNumTrees(value: Int): RandomForestRegressor.this.type def setPredictionCol(value: String): RandomForestRegressor def setSeed(value: Long): RandomForestRegressor.this.type def setSubsamplingRate(value: Double): RandomForestRegressor.this.type
- Newly added parameters
Parameter
spark conf Parameter
Description
Value Type
numTrainingDataCopies
spark.boostkit.ml.rf.numTrainingDataCopies
Number of training data copies
Integer type. The value must be greater than or equal to 1. The default value is 1.
broadcastVariables
spark.boostkit.ml.rf.broadcastVariables
Whether to broadcast variables with large storage space
Boolean type. The default value is false.
numPartsPerTrainingDataCopy
spark.boostkit.ml.rf.numPartsPerTrainingDataCopy
Number of partitions of a single training data copy
Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.
binnedFeaturesDataType
spark.boostkit.ml.rf.binnedFeaturesDataType
Storage format of features in training sample data
String type. The value can be array (default) or fasthashmap.
An example is provided as follows:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
import org.apache.spark.ml.param.{ParamMap, ParamPair} rf= new RandomForestRegressor() // Define the regression model. // Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter. val paramMap = ParamMap(rf.maxDepth -> maxDepth) .put(rf.numTrees, numTrees) // Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter. val paramMaps = new Array[ParamMap](2) for (i <- 0 to paramMaps.size) { paramMaps(i) = ParamMap(rf.maxDepth -> maxDepth) .put(rf.numTrees, numTrees) }//Assign a value to paramMaps. // Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter. val firstParamPair= ParamPair(rf.maxDepth, maxDepth1) val otherParamPairs_1st= ParamPair(rf.maxDepth, maxDepth2) val otherParamPairs_2nd= ParamPair(rf.numTrees, numTrees) // Call the fit APIs. model = rf.fit(trainingData) model = rf.fit(trainingData, paramMap) models = rf.fit(trainingData, paramMaps) model = rf.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd)
- Example code for setting the new parameters
// spark: SparkSession spark.conf.set("spark.boostkit.ml.rf.binnedFeaturesDataType", binnedFeaturesType) spark.conf.set("spark.boostkit.ml.rf.numTrainingDataCopies", numCopiesInput) spark.conf.set("spark.boostkit.ml.rf.numPartsPerTrainingDataCopy", pt) spark.conf.set("spark.boostkit.ml.rf.broadcastVariables", bcVariables) val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") - Output: GBDT regression model (RFRegressionModel or Seq[RFRegressionModel]). The following table lists the field output in model prediction.
Parameter
Value Type
Default Value
Description
predictionCol
Double
"prediction"
Predicted label
- Examplefit(dataset: Dataset[_]): RFRegressionModel example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
import org.apache.spark.ml.Pipeline import org.apache.spark.ml.evaluation.RegressionEvaluator import org.apache.spark.ml.feature.VectorIndexer import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor} // Load and parse the data file, converting it to a DataFrame. val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") // Automatically identify categorical features, and index them. // Set maxCategories so features with > 4 distinct values are treated as continuous. val featureIndexer = new VectorIndexer() .setInputCol("features") .setOutputCol("indexedFeatures") .setMaxCategories(4) .fit(data) // Split the data into training and test sets (30% held out for testing). val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3)) // Train a RandomForest model. val rf = new RandomForestRegressor() .setLabelCol("label") .setFeaturesCol("indexedFeatures") // Chain indexer and forest in a Pipeline. val pipeline = new Pipeline() .setStages(Array(featureIndexer, rf)) // Train model. This also runs the indexer. val model = pipeline.fit(trainingData) // Make predictions. val predictions = model.transform(testData) // Select example rows to display. predictions.select("prediction", "label", "features").show(5) // Select (prediction, true label) and compute test error. val evaluator = new RegressionEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("rmse") val rmse = evaluator.evaluate(predictions) println(s"Root Mean Squared Error (RMSE) on test data = $rmse") val rfModel = model.stages(1).asInstanceOf[RandomForestRegressionModel] println(s"Learned regression forest model:\n ${rfModel.toDebugString}")
- Result
+----------+-----+--------------------+ |prediction|label| features| +----------+-----+--------------------+ | 0.51| 0.3|(1000,[0,1,2,3,4,...| +----------+-----+--------------------+ root mean squared error = 0.21000000000000002