XGBoost
The XGBoost algorithm has two types of model interfaces: ML Classification API and ML Regression API.
Model API Type |
Function API |
|---|---|
ML Classification API |
def fit(dataset: Dataset[_]): XGBoostClassificationModel |
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[XGBoostClassificationModel] |
|
def fit(dataset: Dataset[_], paramMap: ParamMap): XGBoostClassificationModel |
|
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): XGBoostClassificationModel |
|
ML Regression API |
def fit(dataset: Dataset[_]): XGBoostRegressionModel |
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[XGBoostRegressionModel] |
|
def fit(dataset: Dataset[_], paramMap: ParamMap): XGBoostRegressionModel |
|
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): XGBoostRegressionModel |
ML Classification API
- Input/Output
- Package name: package ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
- Class name: XGBoostClassifier
- Method name: fit
- Input: training sample data (Dataset[_]). The following are mandatory fields:
Parameter
Type
Default Value
Description
labelCol
Double
label
Predicted label
featuresCol
Vector
features
Feature label
- Output: XGBoostClassificationModel, which is an XGBoost classification model. The output fields during model prediction are as follows:
Parameter
Type
Example
Description
predictionCol
Double
prediction
Predicted label value
- Algorithm parameters
Algorithm Parameters
def setAllowNonZeroForMissing(value: Boolean): XGBoostClassifier.this.type
def setAlpha(value: Double): XGBoostClassifier.this.type
def setBaseMarginCol(value: String): XGBoostClassifier.this.type
def setBaseScore(value: Double): XGBoostClassifier.this.type
def setCheckpointInterval(value: Int): XGBoostClassifier.this.type
def setCheckpointPath(value: String): XGBoostClassifier.this.type
def setColsampleBylevel(value: Double): XGBoostClassifier.this.type
def setColsampleBytree(value: Double): XGBoostClassifier.this.type
def setCustomEval(value: EvalTrait): XGBoostClassifier.this.type
def setCustomObj(value: ObjectiveTrait): XGBoostClassifier.this.type
def setEta(value: Double): XGBoostClassifier.this.type
def setEvalMetric(value: String): XGBoostClassifier.this.type
def setEvalSets(evalSets: Map[String, DataFrame]): XGBoostClassifier.this.type
def setFeaturesCol(value: String): XGBoostClassifier
def setGamma(value: Double): XGBoostClassifier.this.type
def setGrowPolicy(value: String): XGBoostClassifier.this.type
def setLabelCol(value: String): XGBoostClassifier.this.type
def setLambda(value: Double): XGBoostClassifier.this.type
def setLambdaBias(value: Double): XGBoostClassifier.this.type
def setMaxBins(value: Int): XGBoostClassifier.this.type
def setMaxDeltaStep(value: Double): XGBoostClassifier.this.type
def setMaxDepth(value: Int): XGBoostClassifier.this.type
def setMaxLeaves(value: Int): XGBoostClassifier.this.type
def setMaximizeEvaluationMetrics(value: Boolean): XGBoostClassifier.this.type
def setMinChildWeight(value: Double): XGBoostClassifier.this.type
def setMissing(value: Float): XGBoostClassifier.this.type
def setNormalizeType(value: String): XGBoostClassifier.this.type
def setNthread(value: Int): XGBoostClassifier.this.type
def setNumClass(value: Int): XGBoostClassifier.this.type
def setNumEarlyStoppingRounds(value: Int): XGBoostClassifier.this.type
def setNumRound(value: Int): XGBoostClassifier.this.type
def setNumWorkers(value: Int): XGBoostClassifier.this.type
def setObjective(value: String): XGBoostClassifier.this.type
def setObjectiveType(value: String): XGBoostClassifier.this.type
def setPredictionCol(value: String): XGBoostClassifier
def setProbabilityCol(value: String): XGBoostClassifier
def setRateDrop(value: Double): XGBoostClassifier.this.type
def setRawPredictionCol(value: String): XGBoostClassifier.this.type
def setSampleType(value: String): XGBoostClassifier.this.type
def setScalePosWeight(value: Double): XGBoostClassifier.this.type
def setSeed(value: Long): XGBoostClassifier.this.type
def setSilent(value: Int): XGBoostClassifier.this.type
def setSinglePrecisionHistogram(value: Boolean): XGBoostClassifier.this.type
def setSketchEps(value: Double): XGBoostClassifier.this.type
def setSkipDrop(value: Double): XGBoostClassifier.this.type
def setSubsample(value: Double): XGBoostClassifier.this.type
def setThresholds(value: Array[Double]): XGBoostClassifier
def setTimeoutRequestWorkers(value: Long): XGBoostClassifier.this.type
def setTrainTestRatio(value: Double): XGBoostClassifier.this.type
def setTreeMethod(value: String): XGBoostClassifier.this.type
def setUseExternalMemory(value: Boolean): XGBoostClassifier.this.type
def setWeightCol(value: String): XGBoostClassifier.this.type
- Added algorithm parameters
Parameter
Description
Type
grow_policy
The depthwiselossltd option is added to control the method of adding new tree nodes to the tree. This parameter takes effect only when tree_method is set to hist.
String. The default value is depthwise. The options are depthwise, lossguide, and depthwiselossltd.
min_loss_ratio
Controls the pruning degree of tree nodes during training. This parameter is valid only when grow_policy is set to depthwiselossltd.
Double. Default value: 0. Value range: [0, 1).
sampling_strategy
Controls the sampling strategy in the training process.
String. The default value is eachTree. The options are eachTree, eachIteration, alliteration, multiIteration, and gossStyle.
enable_bbgen
Determines whether to use the batch Bernoulli bit generation algorithm.
Boolean. The value can be true or false. The default value is false.
sampling_step
Controls the number of sampling rounds. This parameter is valid only when sampling_strategy is set to multiIteration.
Int. Default value: 1. Value range: [1, +∞).
auto_subsample
Determines whether to use the policy of automatically reducing the sampling rate.
Boolean. The value can be true or false. The default value is false.
auto_k
Controls the number of rounds in the automatic sampling rate reduction policy. This parameter is valid only when auto_subsample is set to true.
Int. Default value: 1. Value range: [1, +∞).
auto_subsample_ratio
Sets the ratio for automatically reducing the sampling rate.
Array[Double]. Default value: Array(0.05, 0.1, 0.2, 0.4, 0.8, 1.0). Value range: (0, 1].
auto_r
Controls the allowed error rate increase caused by the automatic reduction of the sampling rate.
Double. Default value: 0.95. Value range: (0, 1].
rabit_enable_tcp_no_delay
Controls the communication policy in the Rabit engine.
Boolean. The value can be true or false. The default value is false.
random_split_denom
Controls the proportion of candidate split points.
Int. Default value: 1. Value range: [1, +∞).
default_direction
Controls the default direction of default values.
String. The options are left, right, and learn. The default value is learn.
Code interface example:
1 2 3 4 5
val xgbClassifier = new XGBoostClassifier(param).setLabelCol("label").setFeaturesCol("features") val model = xgbClassifier.fit(train_data) val predictions = model.transform(test_data) val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction") .setMetricName("accuracy") val accuracy = evaluator.evaluate(predictions)
- Example usage
package com.bigdata.ml import java.io.File import java.lang.System.nanoTime import org.apache.spark.sql.{Dataset, Row, SparkSession} import org.apache.spark.SparkConf import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.storage.StorageLevel import scala.util.Random import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier import com.typesafe.config.{Config, ConfigFactory} object Xgboost_test { def profile[R](code: => R, t: Long = nanoTime) = (code, nanoTime - t) def getSparkSession(): SparkSession = { val conf = new SparkConf() .setAppName("XGBOOST-SPARK") val spark = SparkSession .builder() .config( conf ) .getOrCreate() spark.sparkContext.setLogLevel("ERROR") println("SparkSession created successfully!") spark; } def main(args: Array[String]): Unit = { val config = ConfigFactory.parseFile(new File(args(0))) // set seed Random.setSeed(System.currentTimeMillis()) // set session val spark = this.getSparkSession() println("created spark session") println(spark.sparkContext.getConf.toDebugString) val (result, time) = profile(test(spark, config)) val time_sec = time.asInstanceOf[Long].toDouble * 1.0e-9 println(s"Profiling complete in $time_sec seconds. ") } def test(spark: SparkSession, config: Config): Unit = { var param = Map[String, Any]() val it = config.entrySet.iterator while (it.hasNext) { val entry = it.next param += (entry.getKey -> entry.getValue.unwrapped) } if (!config.hasPath("allow_non_zero_for_missing")) { param += ("allow_non_zero_for_missing" -> true) } println(param.mkString(";\n")) val xgbClassifier = new XGBoostClassifier(param) .setLabelCol("label") .setFeaturesCol("features") val time_point1 = System.currentTimeMillis() val train_data = getTrainData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER) val time_point2 = System.currentTimeMillis() val model = xgbClassifier.fit(train_data) val time_point3 = System.currentTimeMillis() val test_data = getTestData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER) val predictions = model.transform(test_data) val time_point4 = System.currentTimeMillis() val load_time = (time_point2 - time_point1) / 1000.0 println(s"Loading complete in $load_time seconds.") val training_time = (time_point3 - time_point2) / 1000.0 println(s"Training complete in $training_time seconds.") val testing_time = (time_point4 - time_point3) / 1000.0 println(s"Testing complete in $testing_time seconds.") // Select (prediction, true label) and compute test error. val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") val accuracy = evaluator.evaluate(predictions) println(s"Test Error = ${(1.0 - accuracy)}") // Select example rows to display. predictions.select("prediction", "label", "features").show(5) } def getTrainData(spark: SparkSession, config: Config): Dataset[Row] = { val tr_fname = config.getString("tr_fname") println("tr_fname", tr_fname) var reader = spark .read .format("libsvm") .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse") if(config.hasPath("numFeatures")) { val numFeatures = config.getInt("numFeatures") println("numFeatures", numFeatures) reader = reader.option("numFeatures", numFeatures) } val tr_data = reader .load(tr_fname) tr_data } def getTestData(spark: SparkSession, config: Config): Dataset[Row] = { val ts_fname = config.getString("ts_fname") println("ts_fname", ts_fname) var reader = spark .read .format("libsvm") .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse") if(config.hasPath("numFeatures")) { val numFeatures = config.getInt("numFeatures") println("numFeatures", numFeatures) reader = reader.option("numFeatures", numFeatures) } val ts_data = reader .load(ts_fname) ts_data } } - Example result
Test Error = 0.253418287207109 +----------+-----+--------------------+ |prediction|label| features| +----------+-----+--------------------+ | 0.0| 0.0|(28,[0,1,2,3,4,5,...| | 0.0| 0.0|(28,[0,1,2,3,4,5,...| | 0.0| 0.0|(28,[0,1,2,3,4,5,...| | 0.0| 0.0|(28,[0,1,2,3,4,5,...| | 0.0| 0.0|(28,[0,1,2,3,4,5,...| +----------+-----+--------------------+ only showing top 5 rows
- Example usage
ML Regression API
- Input/Output
- Package name: package ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor
- Class name: XGBoostRegressor
- Method name: fit
- Input: training sample data (Dataset[_]). The following are mandatory fields:
Parameter
Type
Default Value
Description
labelCol
Double
label
Predicted label
featuresCol
Vector
features
Feature label
- Output: XGBoostClassificationModel, which is an XGBoost classification model. The output fields during model prediction are as follows:
Parameter
Type
Example
Description
predictionCol
Double
prediction
Predicted label value
- Algorithm parameters
Algorithm Parameters
def setAllowNonZeroForMissing(value: Boolean): XGBoostRegressor.this.type
def setAlpha(value: Double): XGBoostRegressor.this.type
def setBaseMarginCol(value: String): XGBoostRegressor.this.type
def setBaseScore(value: Double): XGBoostRegressor.this.type
def setCheckpointInterval(value: Int): XGBoostRegressor.this.type
def setCheckpointPath(value: String): XGBoostRegressor.this.type
def setColsampleBylevel(value: Double): XGBoostRegressor.this.type
def setColsampleBytree(value: Double): XGBoostRegressor.this.type
def setCustomEval(value: EvalTrait): XGBoostRegressor.this.type
def setCustomObj(value: ObjectiveTrait): XGBoostRegressor.this.type
def setEta(value: Double): XGBoostRegressor.this.type
def setEvalMetric(value: String): XGBoostRegressor.this.type
def setEvalSets(evalSets: Map[String, DataFrame]): XGBoostRegressor.this.type
def setFeaturesCol(value: String): XGBoostRegressor
def setGamma(value: Double): XGBoostRegressor.this.type
def setGroupCol(value: String): XGBoostRegressor.this.type
def setGrowPolicy(value: String): XGBoostRegressor.this.type
def setLabelCol(value: String): XGBoostRegressor.this.type
def setLambda(value: Double): XGBoostRegressor.this.type
def setLambdaBias(value: Double): XGBoostRegressor.this.type
def setMaxBins(value: Int): XGBoostRegressor.this.type
def setMaxDeltaStep(value: Double): XGBoostRegressor.this.type
def setMaxDepth(value: Int): XGBoostRegressor.this.type
def setMaxLeaves(value: Int): XGBoostRegressor.this.type
def setMaximizeEvaluationMetrics(value: Boolean): XGBoostRegressor.this.type
def setMinChildWeight(value: Double): XGBoostRegressor.this.type
def setMissing(value: Float): XGBoostRegressor.this.type
def setNormalizeType(value: String): XGBoostRegressor.this.type
def setNthread(value: Int): XGBoostRegressor.this.type
def setNumClass(value: Int): XGBoostRegressor.this.type
def setNumEarlyStoppingRounds(value: Int): XGBoostRegressor.this.type
def setNumRound(value: Int): XGBoostRegressor.this.type
def setNumWorkers(value: Int): XGBoostRegressor.this.type
def setObjective(value: String): XGBoostRegressor.this.type
def setObjectiveType(value: String): XGBoostRegressor.this.type
def setPredictionCol(value: String): XGBoostRegressor.this.type
def setRateDrop(value: Double): XGBoostRegressor.this.type
def setRawPredictionCol(value: String): XGBoostRegressor
def setSampleType(value: String): XGBoostRegressor.this.type
def setScalePosWeight(value: Double): XGBoostRegressor.this.type
def setSeed(value: Long): XGBoostRegressor.this.type
def setSilent(value: Int): XGBoostRegressor.this.type
def setSinglePrecisionHistogram(value: Boolean): XGBoostRegressor.this.type
def setSketchEps(value: Double): XGBoostRegressor.this.type
def setSkipDrop(value: Double): XGBoostRegressor.this.type
def setSubsample(value: Double): XGBoostRegressor.this.type
def setThresholds(value: Array[Double]): XGBoostRegressor
def setTimeoutRequestWorkers(value: Long): XGBoostRegressor.this.type
def setTrainTestRatio(value: Double): XGBoostRegressor.this.type
def setTreeMethod(value: String): XGBoostRegressor.this.type
def setUseExternalMemory(value: Boolean): XGBoostRegressor.this.type
def setWeightCol(value: String): XGBoostRegressor.this.type
- Added algorithm parameters
Parameter
Description
Type
grow_policy
The depthwiselossltd option is added to control the method of adding new tree nodes to the tree. This parameter takes effect only when tree_method is set to hist.
String. The default value is depthwise. The options are depthwise, lossguide, and depthwiselossltd.
min_loss_ratio
Controls the pruning degree of tree nodes during training. This parameter is valid only when grow_policy is set to depthwiselossltd.
Double. Default value: 0. Value range: [0, 1).
sampling_strategy
Controls the sampling strategy in the training process.
String. The default value is eachTree. The options are eachTree, eachIteration, alliteration, multiIteration, and gossStyle.
enable_bbgen
Determines whether to use the batch Bernoulli bit generation algorithm.
Boolean. The value can be true or false. The default value is false.
sampling_step
Controls the number of sampling rounds. This parameter is valid only when sampling_strategy is set to multiIteration.
Int. Default value: 1. Value range: [1, +∞).
auto_subsample
Determines whether to use the policy of automatically reducing the sampling rate.
Boolean. The value can be true or false. The default value is false.
auto_k
Controls the number of rounds in the automatic sampling rate reduction policy. This parameter is valid only when auto_subsample is set to true.
Int. Default value: 1. Value range: [1, +∞).
auto_subsample_ratio
Sets the ratio for automatically reducing the sampling rate.
Array[Double]. Default value: Array(0.05,0.1,0.2,0.4,0.8,1.0). Value range: (0, 1].
auto_r
Controls the allowed error rate increase caused by the automatic reduction of the sampling rate.
Double. Default value: 0.95. Value range: (0, 1].
rabit_enable_tcp_no_delay
Controls the communication policy in the Rabit engine.
Boolean. The value can be true or false. The default value is false.
random_split_denom
Controls the proportion of candidate split points.
Int.
Default value: 1. Value range: [1, +∞).
default_direction
Controls the default direction of default values.
String. The options are left, right, and learn. The default value is learn.
Code interface example: val xgbRegression = new XGBoostRegressor(param).setLabelCol("label").setFeaturesCol("features") val model = xgbRegression.fit(train_data) val predictions = model.transform(test_data) val evaluator = new RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy") val accuracy = evaluator.evaluate(predictions)- Example usage
package com.bigdata.ml import java.io.File import java.lang.System.nanoTime import org.apache.spark.sql.{Dataset, Row, SparkSession} import org.apache.spark.SparkConf import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.storage.StorageLevel import scala.util.Random import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier import com.typesafe.config.{Config, ConfigFactory} object Xgboost_test { def profile[R](code: => R, t: Long = nanoTime) = (code, nanoTime - t) def getSparkSession(): SparkSession = { val conf = new SparkConf() .setAppName("XGBOOST-SPARK") val spark = SparkSession .builder() .config( conf ) .getOrCreate() spark.sparkContext.setLogLevel("ERROR") println("SparkSession created successfully!") spark; } def main(args: Array[String]): Unit = { val config = ConfigFactory.parseFile(new File(args(0))) // set seed Random.setSeed(System.currentTimeMillis()) // set session val spark = this.getSparkSession() println("created spark session") println(spark.sparkContext.getConf.toDebugString) val (result, time) = profile(test(spark, config)) val time_sec = time.asInstanceOf[Long].toDouble * 1.0e-9 println(s"Profiling complete in $time_sec seconds. ") } def test(spark: SparkSession, config: Config): Unit = { var param = Map[String, Any]() val it = config.entrySet.iterator while (it.hasNext) { val entry = it.next param += (entry.getKey -> entry.getValue.unwrapped) } if (!config.hasPath("allow_non_zero_for_missing")) { param += ("allow_non_zero_for_missing" -> true) } println(param.mkString(";\n")) val xgbRegression = new XGBoostRegressor(param) .setLabelCol("label") .setFeaturesCol("features") val time_point1 = System.currentTimeMillis() val train_data = getTrainData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER) val time_point2 = System.currentTimeMillis() val model = xgbRegression.fit(train_data) val time_point3 = System.currentTimeMillis() val test_data = getTestData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER) val predictions = model.transform(test_data) val time_point4 = System.currentTimeMillis() val load_time = (time_point2 - time_point1) / 1000.0 println(s"Loading complete in $load_time seconds.") val training_time = (time_point3 - time_point2) / 1000.0 println(s"Training complete in $training_time seconds.") val testing_time = (time_point4 - time_point3) / 1000.0 println(s"Testing complete in $testing_time seconds.") // Select (prediction, true label) and compute test error. val evaluator = new RegressionEvaluator() .setLabelCol("label") .setPredictionCol("prediction") .setMetricName("accuracy") val accuracy = evaluator.evaluate(predictions) println(s"Test Error = ${(1.0 - accuracy)}") // Select example rows to display. predictions.select("prediction", "label", "features").show(5) } def getTrainData(spark: SparkSession, config: Config): Dataset[Row] = { val tr_fname = config.getString("tr_fname") println("tr_fname", tr_fname) var reader = spark .read .format("libsvm") .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse") if(config.hasPath("numFeatures")) { val numFeatures = config.getInt("numFeatures") println("numFeatures", numFeatures) reader = reader.option("numFeatures", numFeatures) } val tr_data = reader .load(tr_fname) tr_data } def getTestData(spark: SparkSession, config: Config): Dataset[Row] = { val ts_fname = config.getString("ts_fname") println("ts_fname", ts_fname) var reader = spark .read .format("libsvm") .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse") if(config.hasPath("numFeatures")) { val numFeatures = config.getInt("numFeatures") println("numFeatures", numFeatures) reader = reader.option("numFeatures", numFeatures) } val ts_data = reader .load(ts_fname) ts_data } } - Example result
Test Error = 0.5872398843658918 +--------------------+-----+--------------------+ | prediction|label| features| +--------------------+-----+--------------------+ | 0.2738455533981323| 0.0|(28,[0,1,2,3,4,5,...| |0.052151769399642944| 0.0|(28,[0,1,2,3,4,5,...| | 0.08468279242515564| 0.0|(28,[0,1,2,3,4,5,...| | 0.20581847429275513| 0.0|(28,[0,1,2,3,4,5,...| | 0.3741578459739685| 0.0|(28,[0,1,2,3,4,5,...| +--------------------+-----+--------------------+ only showing top 5 rows
- Example usage