Random Forest

The Random Forest algorithm provides two types of model APIs: ML Classification API and ML Regression API.

Model API Type	Function API
ML Classification API	def fit(dataset: Dataset[_]): RandomForestClassificationModel
	def fit(dataset: Dataset[_], paramMap: ParamMap): RandomForestClassificationModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[RandomForestClassificationModel]
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): RandomForestClassificationModel
ML Regression API	def fit(dataset: Dataset[_]): RandomForestClassificationModel
	def fit(dataset: Dataset[_], paramMap: ParamMap): GBTRegressionModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[RandomForestClassificationModel]
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): RandomForestClassificationModel

ML Classification API

Function description
Output the Random Forest classification model after you input sample data in dataset format and call the training API.

Input and output

Package name: package org.apache.spark.ml.classification
Class name: RandomForestClassifier
Method name: fit
Input: training sample data (Dataset[_]). The following are mandatory fields.
Parameter

Value Type

Default Value

Description

labelCol

Double

label

Predicted label

featuresCol

Vector

features

Feature label

Parameter	Value Type	Default Value	Description
labelCol	Double	label	Predicted label
featuresCol	Vector	features	Feature label

Input: the model parameters of the fit API paramMap, paramMaps, firstParamPair, otherParamPairs, which are described as follows:

Parameter	Value Type	Example	Description
paramMap	ParamMap	ParamMap(A.c -> b)	Assigns the value of b to the parameter c of model A.
paramMaps	Array[ParamMap]	Array[ParamMap](n)	Generates n parameter lists for the ParamMap model.
firstParamPair	ParamPair	ParamPair(A.c, b)	Assigns the value of b to the parameter c of model A.
otherParamPairs	ParamPair	ParamPair(A.e, f)	Assigns the value of f to the parameter e of model A.

Parameters optimized based on native algorithms

def setCheckpointInterval(value: Int): RandomForestClassifier.this.type
def setFeatureSubsetStrategy(value: String): RandomForestClassifier.this.type
def setFeaturesCol(value: String): RandomForestClassifier
def setImpurity(value: String): RandomForestClassifier.this.type
def setLabelCol(value: String): RandomForestClassifier
def setMaxBins(value: Int): RandomForestClassifier.this.type
def setMaxDepth(value: Int): RandomForestClassifier.this.type
def setMinInfoGain(value: Double): RandomForestClassifier.this.type
def setMinInstancesPerNode(value: Int): RandomForestClassifier.this.type
def setNumTrees(value: Int): RandomForestClassifier.this.type
def setPredictionCol(value: String): RandomForestClassifier
def setProbabilityCol(value: String): RandomForestClassifier
def setRawPredictionCol(value: String): RandomForestClassifier
def setSeed(value: Long): RandomForestClassifier.this.type
def setSubsamplingRate(value: Double): RandomForestClassifier.this.type
def setThresholds(value: Array[Double]): RandomForestClassifier

Newly added parameters

Parameter	spark conf Parameter	Description	Value Type
numTrainingDataCopies	spark.boostkit.ml.rf.numTrainingDataCopies	Number of training data copies	Integer type. The value must be greater than or equal to 1. The default value is 1.
broadcastVariables	spark.boostkit.ml.rf.broadcastVariables	Whether to broadcast variables with large storage space	Boolean type. The default value is false.
numPartsPerTrainingDataCopy	spark.boostkit.ml.rf.numPartsPerTrainingDataCopy	Number of partitions of a single training data copy	Integer type. The value must be greater than or equal to 0. The default value is 0, indicating that re-partitioning is not performed.
binnedFeaturesDataType	spark.boostkit.ml.rf.binnedFeaturesDataType	Storage format of features in training sample data	String type. The supported storage formats are array (default) and fasthashmap.

An example is provided as follows:

import org.apache.spark.ml.param.{ParamMap, ParamPair}

val rf= new RandomForestClassifier()// Define the random forest classifier.

// Define the def fit(dataset: Dataset[_], paramMap: ParamMap) API parameter.
val paramMap = ParamMap(rf.maxDepth -> maxDepth).put(rf.numTrees, numTrees)

// Define the def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): API parameter.
val paramMaps = new Array[ParamMap](2)
for (i <- 0 to  paramMaps.size) {
paramMaps(i) = ParamMap(rf.maxDepth -> maxDepth)
.put(rf.numTrees, numTrees)
}//Assign a value to paramMaps.

// Define the def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*) API parameter.
val firstParamPair= ParamPair(rf.maxDepth, maxDepth1)
val otherParamPairs_1st= ParamPair(rf.maxDepth, maxDepth2)
val otherParamPairs_2nd= ParamPair(rf.numTrees, numTrees)

// Call the fit APIs.
model = rf.fit(trainingData)
model = rf.fit(trainingData, paramMap)
models = rf.fit(trainingData, paramMaps)
model = rf.fit(trainingData, firstParamPair, otherParamPairs_1st, otherParamPairs_2nd)

Example code for setting the new parameters

// spark: SparkSession
spark.conf.set("spark.boostkit.ml.rf.binnedFeaturesDataType", binnedFeaturesType) spark.conf.set("spark.boostkit.ml.rf.numTrainingDataCopies", numCopiesInput) spark.conf.set("spark.boostkit.ml.rf.numPartsPerTrainingDataCopy", pt) spark.conf.set("spark.boostkit.ml.rf.broadcastVariables", bcVariables)
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

Output: Random Forest classification model (RFClassificationModel). The following table lists the field output in model prediction.
Parameter

Value Type

Default Value

Description

predictionCol

Double

prediction

Predicted label

Parameter	Value Type	Default Value	Description
predictionCol	Double	prediction	Predicted label

Example

fit(dataset: Dataset[_]): RFClassificationModel example:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}
 
// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
 
// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)
 
// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
 
// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)
 
// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)
 
// Chain indexers and forest in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))
 
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
 
// Make predictions.
val predictions = model.transform(testData)
 
// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)
 
// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
 
val rfModel = model.stages(2).asInstanceOf[RandomForestClassificationModel]
println(s"Learned classification forest model:\n ${rfModel.toDebugString}")

Result

+--------------+-----+--------------------+
|predictedLabel|label|            features|
+--------------+-----+--------------------+
|           1.0|  1.0|(47236,[270,439,5...|
|           1.0|  1.0|(47236,[3023,6093...|
|          -1.0| -1.0|(47236,[270,391,4...|
|          -1.0| -1.0|(47236,[3718,3723...|
|           1.0|  1.0|(47236,[729,760,1...|
+--------------+-----+--------------------+
only showing top 5 rows

Test Error = 0.06476632743800015

ML Regression API

Function description
Output the Random Forest classification model after you input sample data in RDD format and call the training API.