XGBoost

The XGBoost algorithm has two types of model interfaces: ML Classification API and ML Regression API.

Model API Type	Function API
ML Classification API	def fit(dataset: Dataset[_]): XGBoostClassificationModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[XGBoostClassificationModel]
	def fit(dataset: Dataset[_], paramMap: ParamMap): XGBoostClassificationModel
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): XGBoostClassificationModel
ML Regression API	def fit(dataset: Dataset[_]): XGBoostRegressionModel
	def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[XGBoostRegressionModel]
	def fit(dataset: Dataset[_], paramMap: ParamMap): XGBoostRegressionModel
	def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): XGBoostRegressionModel

ML Classification API

Function
Import sample data in dataset format, call the training API, and output the XGBoost classification model.

Input/Output

Package name: package ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
Class name: XGBoostClassifier
Method name: fit
Input: training sample data (Dataset[_]). The following are mandatory fields:
Parameter

Type

Default Value

Description

labelCol

Double

label

Predicted label

featuresCol

Vector

features

Feature label
Output: XGBoostClassificationModel, which is an XGBoost classification model. The output fields during model prediction are as follows:
Parameter

Type

Example

Description

predictionCol

Double

prediction

Predicted label value

Parameter	Type	Default Value	Description
labelCol	Double	label	Predicted label
featuresCol	Vector	features	Feature label

Parameter	Type	Example	Description
predictionCol	Double	prediction	Predicted label value

Algorithm parameters

Algorithm Parameters
def setAllowNonZeroForMissing(value: Boolean): XGBoostClassifier.this.type def setAlpha(value: Double): XGBoostClassifier.this.type def setBaseMarginCol(value: String): XGBoostClassifier.this.type def setBaseScore(value: Double): XGBoostClassifier.this.type def setCheckpointInterval(value: Int): XGBoostClassifier.this.type def setCheckpointPath(value: String): XGBoostClassifier.this.type def setColsampleBylevel(value: Double): XGBoostClassifier.this.type def setColsampleBytree(value: Double): XGBoostClassifier.this.type def setCustomEval(value: EvalTrait): XGBoostClassifier.this.type def setCustomObj(value: ObjectiveTrait): XGBoostClassifier.this.type def setEta(value: Double): XGBoostClassifier.this.type def setEvalMetric(value: String): XGBoostClassifier.this.type def setEvalSets(evalSets: Map[String, DataFrame]): XGBoostClassifier.this.type def setFeaturesCol(value: String): XGBoostClassifier def setGamma(value: Double): XGBoostClassifier.this.type def setGrowPolicy(value: String): XGBoostClassifier.this.type def setLabelCol(value: String): XGBoostClassifier.this.type def setLambda(value: Double): XGBoostClassifier.this.type def setLambdaBias(value: Double): XGBoostClassifier.this.type def setMaxBins(value: Int): XGBoostClassifier.this.type def setMaxDeltaStep(value: Double): XGBoostClassifier.this.type def setMaxDepth(value: Int): XGBoostClassifier.this.type def setMaxLeaves(value: Int): XGBoostClassifier.this.type def setMaximizeEvaluationMetrics(value: Boolean): XGBoostClassifier.this.type def setMinChildWeight(value: Double): XGBoostClassifier.this.type def setMissing(value: Float): XGBoostClassifier.this.type def setNormalizeType(value: String): XGBoostClassifier.this.type def setNthread(value: Int): XGBoostClassifier.this.type def setNumClass(value: Int): XGBoostClassifier.this.type def setNumEarlyStoppingRounds(value: Int): XGBoostClassifier.this.type def setNumRound(value: Int): XGBoostClassifier.this.type def setNumWorkers(value: Int): XGBoostClassifier.this.type def setObjective(value: String): XGBoostClassifier.this.type def setObjectiveType(value: String): XGBoostClassifier.this.type def setPredictionCol(value: String): XGBoostClassifier def setProbabilityCol(value: String): XGBoostClassifier def setRateDrop(value: Double): XGBoostClassifier.this.type def setRawPredictionCol(value: String): XGBoostClassifier.this.type def setSampleType(value: String): XGBoostClassifier.this.type def setScalePosWeight(value: Double): XGBoostClassifier.this.type def setSeed(value: Long): XGBoostClassifier.this.type def setSilent(value: Int): XGBoostClassifier.this.type def setSinglePrecisionHistogram(value: Boolean): XGBoostClassifier.this.type def setSketchEps(value: Double): XGBoostClassifier.this.type def setSkipDrop(value: Double): XGBoostClassifier.this.type def setSubsample(value: Double): XGBoostClassifier.this.type def setThresholds(value: Array[Double]): XGBoostClassifier def setTimeoutRequestWorkers(value: Long): XGBoostClassifier.this.type def setTrainTestRatio(value: Double): XGBoostClassifier.this.type def setTreeMethod(value: String): XGBoostClassifier.this.type def setUseExternalMemory(value: Boolean): XGBoostClassifier.this.type def setWeightCol(value: String): XGBoostClassifier.this.type

Algorithm Parameters

def setAllowNonZeroForMissing(value: Boolean): XGBoostClassifier.this.type

def setAlpha(value: Double): XGBoostClassifier.this.type

def setBaseMarginCol(value: String): XGBoostClassifier.this.type

def setBaseScore(value: Double): XGBoostClassifier.this.type

def setCheckpointInterval(value: Int): XGBoostClassifier.this.type

def setCheckpointPath(value: String): XGBoostClassifier.this.type

def setColsampleBylevel(value: Double): XGBoostClassifier.this.type

def setColsampleBytree(value: Double): XGBoostClassifier.this.type

def setCustomEval(value: EvalTrait): XGBoostClassifier.this.type

def setCustomObj(value: ObjectiveTrait): XGBoostClassifier.this.type

def setEta(value: Double): XGBoostClassifier.this.type

def setEvalMetric(value: String): XGBoostClassifier.this.type

def setEvalSets(evalSets: Map[String, DataFrame]): XGBoostClassifier.this.type

def setFeaturesCol(value: String): XGBoostClassifier

def setGamma(value: Double): XGBoostClassifier.this.type

def setGrowPolicy(value: String): XGBoostClassifier.this.type

def setLabelCol(value: String): XGBoostClassifier.this.type

def setLambda(value: Double): XGBoostClassifier.this.type

def setLambdaBias(value: Double): XGBoostClassifier.this.type

def setMaxBins(value: Int): XGBoostClassifier.this.type

def setMaxDeltaStep(value: Double): XGBoostClassifier.this.type

def setMaxDepth(value: Int): XGBoostClassifier.this.type

def setMaxLeaves(value: Int): XGBoostClassifier.this.type

def setMaximizeEvaluationMetrics(value: Boolean): XGBoostClassifier.this.type

def setMinChildWeight(value: Double): XGBoostClassifier.this.type

def setMissing(value: Float): XGBoostClassifier.this.type

def setNormalizeType(value: String): XGBoostClassifier.this.type

def setNthread(value: Int): XGBoostClassifier.this.type

def setNumClass(value: Int): XGBoostClassifier.this.type

def setNumEarlyStoppingRounds(value: Int): XGBoostClassifier.this.type

def setNumRound(value: Int): XGBoostClassifier.this.type

def setNumWorkers(value: Int): XGBoostClassifier.this.type

def setObjective(value: String): XGBoostClassifier.this.type

def setObjectiveType(value: String): XGBoostClassifier.this.type

def setPredictionCol(value: String): XGBoostClassifier

def setProbabilityCol(value: String): XGBoostClassifier

def setRateDrop(value: Double): XGBoostClassifier.this.type

def setRawPredictionCol(value: String): XGBoostClassifier.this.type

def setSampleType(value: String): XGBoostClassifier.this.type

def setScalePosWeight(value: Double): XGBoostClassifier.this.type

def setSeed(value: Long): XGBoostClassifier.this.type

def setSilent(value: Int): XGBoostClassifier.this.type

def setSinglePrecisionHistogram(value: Boolean): XGBoostClassifier.this.type

def setSketchEps(value: Double): XGBoostClassifier.this.type

def setSkipDrop(value: Double): XGBoostClassifier.this.type

def setSubsample(value: Double): XGBoostClassifier.this.type

def setThresholds(value: Array[Double]): XGBoostClassifier

def setTimeoutRequestWorkers(value: Long): XGBoostClassifier.this.type

def setTrainTestRatio(value: Double): XGBoostClassifier.this.type

def setTreeMethod(value: String): XGBoostClassifier.this.type

def setUseExternalMemory(value: Boolean): XGBoostClassifier.this.type

def setWeightCol(value: String): XGBoostClassifier.this.type

Added algorithm parameters

Parameter	Description	Type
grow_policy	The depthwiselossltd option is added to control the method of adding new tree nodes to the tree. This parameter takes effect only when tree_method is set to hist.	String. The default value is depthwise. The options are depthwise, lossguide, and depthwiselossltd.
min_loss_ratio	Controls the pruning degree of tree nodes during training. This parameter is valid only when grow_policy is set to depthwiselossltd.	Double. Default value: 0. Value range: [0, 1).
sampling_strategy	Controls the sampling strategy in the training process.	String. The default value is eachTree. The options are eachTree, eachIteration, alliteration, multiIteration, and gossStyle.
enable_bbgen	Determines whether to use the batch Bernoulli bit generation algorithm.	Boolean. The value can be true or false. The default value is false.
sampling_step	Controls the number of sampling rounds. This parameter is valid only when sampling_strategy is set to multiIteration.	Int. Default value: 1. Value range: [1, +∞).
auto_subsample	Determines whether to use the policy of automatically reducing the sampling rate.	Boolean. The value can be true or false. The default value is false.
auto_k	Controls the number of rounds in the automatic sampling rate reduction policy. This parameter is valid only when auto_subsample is set to true.	Int. Default value: 1. Value range: [1, +∞).
auto_subsample_ratio	Sets the ratio for automatically reducing the sampling rate.	Array[Double]. Default value: Array(0.05, 0.1, 0.2, 0.4, 0.8, 1.0). Value range: (0, 1].
auto_r	Controls the allowed error rate increase caused by the automatic reduction of the sampling rate.	Double. Default value: 0.95. Value range: (0, 1].
rabit_enable_tcp_no_delay	Controls the communication policy in the Rabit engine.	Boolean. The value can be true or false. The default value is false.
random_split_denom	Controls the proportion of candidate split points.	Int. Default value: 1. Value range: [1, +∞).
default_direction	Controls the default direction of default values.	String. The options are left, right, and learn. The default value is learn.

Code interface example:

val xgbClassifier = new XGBoostClassifier(param).setLabelCol("label").setFeaturesCol("features")
val model = xgbClassifier.fit(train_data)
val predictions = model.transform(test_data)
val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)

Example usage

package com.bigdata.ml
import java.io.File
import java.lang.System.nanoTime
import org.apache.spark.sql.{Dataset, Row, SparkSession}
import org.apache.spark.SparkConf
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.storage.StorageLevel
import scala.util.Random
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
import com.typesafe.config.{Config, ConfigFactory}

object Xgboost_test {
  def profile[R](code: => R, t: Long = nanoTime) = (code, nanoTime - t)
  def getSparkSession(): SparkSession = {
    val conf = new SparkConf()
      .setAppName("XGBOOST-SPARK")
    val spark =
        SparkSession
          .builder()
          .config( conf )
          .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    println("SparkSession created successfully!")
    spark;
  }

  def main(args: Array[String]): Unit = {
    val config = ConfigFactory.parseFile(new File(args(0)))
    // set seed
    Random.setSeed(System.currentTimeMillis())
    // set session
    val spark = this.getSparkSession()
    println("created spark session")
    println(spark.sparkContext.getConf.toDebugString)
    val (result, time) = profile(test(spark, config))
    val time_sec = time.asInstanceOf[Long].toDouble * 1.0e-9
    println(s"Profiling complete in $time_sec seconds. ")
  }

  def test(spark: SparkSession,
           config: Config): Unit = {
    var param = Map[String, Any]()
    val it = config.entrySet.iterator
    while (it.hasNext) {
      val entry = it.next
      param += (entry.getKey -> entry.getValue.unwrapped)
    }
    if (!config.hasPath("allow_non_zero_for_missing")) {
      param += ("allow_non_zero_for_missing" -> true)
    }
    println(param.mkString(";\n"))
    val xgbClassifier = new XGBoostClassifier(param)
      .setLabelCol("label")
      .setFeaturesCol("features")
    val time_point1 = System.currentTimeMillis()
    val train_data = getTrainData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER)
    val time_point2 = System.currentTimeMillis()
    val model = xgbClassifier.fit(train_data)
    val time_point3 = System.currentTimeMillis()
    val test_data = getTestData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER)
    val predictions = model.transform(test_data)
    val time_point4 = System.currentTimeMillis()
    val load_time = (time_point2 - time_point1) / 1000.0
    println(s"Loading complete in $load_time seconds.")
    val training_time = (time_point3 - time_point2) / 1000.0
    println(s"Training complete in $training_time seconds.")
    val testing_time = (time_point4 - time_point3) / 1000.0
    println(s"Testing complete in $testing_time seconds.")

    // Select (prediction, true label) and compute test error.
    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("label")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")
    val accuracy = evaluator.evaluate(predictions)
    println(s"Test Error = ${(1.0 - accuracy)}")

    // Select example rows to display.
    predictions.select("prediction", "label", "features").show(5)
  }
  def getTrainData(spark: SparkSession, config: Config): Dataset[Row] = {
    val tr_fname = config.getString("tr_fname")
    println("tr_fname", tr_fname)
    var reader =  spark
      .read
      .format("libsvm")
      .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse")
    if(config.hasPath("numFeatures")) {
      val numFeatures = config.getInt("numFeatures")
      println("numFeatures", numFeatures)
      reader = reader.option("numFeatures", numFeatures)
    }
    val tr_data = reader
      .load(tr_fname)
    tr_data
  }

  def getTestData(spark: SparkSession, config: Config): Dataset[Row] = {
    val ts_fname = config.getString("ts_fname")
    println("ts_fname", ts_fname)
    var reader =  spark
      .read
      .format("libsvm")
      .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse")
    if(config.hasPath("numFeatures")) {
      val numFeatures = config.getInt("numFeatures")
      println("numFeatures", numFeatures)
      reader = reader.option("numFeatures", numFeatures)
    }

    val ts_data = reader
      .load(ts_fname)
    ts_data
  }
}

Example result

Test Error =  0.253418287207109
+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|  0.0|(28,[0,1,2,3,4,5,...|
|       0.0|  0.0|(28,[0,1,2,3,4,5,...|
|       0.0|  0.0|(28,[0,1,2,3,4,5,...|
|       0.0|  0.0|(28,[0,1,2,3,4,5,...|
|       0.0|  0.0|(28,[0,1,2,3,4,5,...|
+----------+-----+--------------------+
only showing top 5 rows

ML Regression API

Function
Import sample data in dataset format, call the training API, and output the XGBoost classification model.

Input/Output

Package name: package ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor
Class name: XGBoostRegressor
Method name: fit
Input: training sample data (Dataset[_]). The following are mandatory fields:
Parameter

Type

Default Value

Description

labelCol

Double

label

Predicted label

featuresCol

Vector

features

Feature label
Output: XGBoostClassificationModel, which is an XGBoost classification model. The output fields during model prediction are as follows:
Parameter

Type

Example

Description

predictionCol

Double

prediction

Predicted label value

Parameter	Type	Default Value	Description
labelCol	Double	label	Predicted label
featuresCol	Vector	features	Feature label

Parameter	Type	Example	Description
predictionCol	Double	prediction	Predicted label value

Algorithm parameters

Algorithm Parameters
def setAllowNonZeroForMissing(value: Boolean): XGBoostRegressor.this.type def setAlpha(value: Double): XGBoostRegressor.this.type def setBaseMarginCol(value: String): XGBoostRegressor.this.type def setBaseScore(value: Double): XGBoostRegressor.this.type def setCheckpointInterval(value: Int): XGBoostRegressor.this.type def setCheckpointPath(value: String): XGBoostRegressor.this.type def setColsampleBylevel(value: Double): XGBoostRegressor.this.type def setColsampleBytree(value: Double): XGBoostRegressor.this.type def setCustomEval(value: EvalTrait): XGBoostRegressor.this.type def setCustomObj(value: ObjectiveTrait): XGBoostRegressor.this.type def setEta(value: Double): XGBoostRegressor.this.type def setEvalMetric(value: String): XGBoostRegressor.this.type def setEvalSets(evalSets: Map[String, DataFrame]): XGBoostRegressor.this.type def setFeaturesCol(value: String): XGBoostRegressor def setGamma(value: Double): XGBoostRegressor.this.type def setGroupCol(value: String): XGBoostRegressor.this.type def setGrowPolicy(value: String): XGBoostRegressor.this.type def setLabelCol(value: String): XGBoostRegressor.this.type def setLambda(value: Double): XGBoostRegressor.this.type def setLambdaBias(value: Double): XGBoostRegressor.this.type def setMaxBins(value: Int): XGBoostRegressor.this.type def setMaxDeltaStep(value: Double): XGBoostRegressor.this.type def setMaxDepth(value: Int): XGBoostRegressor.this.type def setMaxLeaves(value: Int): XGBoostRegressor.this.type def setMaximizeEvaluationMetrics(value: Boolean): XGBoostRegressor.this.type def setMinChildWeight(value: Double): XGBoostRegressor.this.type def setMissing(value: Float): XGBoostRegressor.this.type def setNormalizeType(value: String): XGBoostRegressor.this.type def setNthread(value: Int): XGBoostRegressor.this.type def setNumClass(value: Int): XGBoostRegressor.this.type def setNumEarlyStoppingRounds(value: Int): XGBoostRegressor.this.type def setNumRound(value: Int): XGBoostRegressor.this.type def setNumWorkers(value: Int): XGBoostRegressor.this.type def setObjective(value: String): XGBoostRegressor.this.type def setObjectiveType(value: String): XGBoostRegressor.this.type def setPredictionCol(value: String): XGBoostRegressor.this.type def setRateDrop(value: Double): XGBoostRegressor.this.type def setRawPredictionCol(value: String): XGBoostRegressor def setSampleType(value: String): XGBoostRegressor.this.type def setScalePosWeight(value: Double): XGBoostRegressor.this.type def setSeed(value: Long): XGBoostRegressor.this.type def setSilent(value: Int): XGBoostRegressor.this.type def setSinglePrecisionHistogram(value: Boolean): XGBoostRegressor.this.type def setSketchEps(value: Double): XGBoostRegressor.this.type def setSkipDrop(value: Double): XGBoostRegressor.this.type def setSubsample(value: Double): XGBoostRegressor.this.type def setThresholds(value: Array[Double]): XGBoostRegressor def setTimeoutRequestWorkers(value: Long): XGBoostRegressor.this.type def setTrainTestRatio(value: Double): XGBoostRegressor.this.type def setTreeMethod(value: String): XGBoostRegressor.this.type def setUseExternalMemory(value: Boolean): XGBoostRegressor.this.type def setWeightCol(value: String): XGBoostRegressor.this.type

Algorithm Parameters

def setAllowNonZeroForMissing(value: Boolean): XGBoostRegressor.this.type

def setAlpha(value: Double): XGBoostRegressor.this.type

def setBaseMarginCol(value: String): XGBoostRegressor.this.type

def setBaseScore(value: Double): XGBoostRegressor.this.type

def setCheckpointInterval(value: Int): XGBoostRegressor.this.type

def setCheckpointPath(value: String): XGBoostRegressor.this.type

def setColsampleBylevel(value: Double): XGBoostRegressor.this.type

def setColsampleBytree(value: Double): XGBoostRegressor.this.type

def setCustomEval(value: EvalTrait): XGBoostRegressor.this.type

def setCustomObj(value: ObjectiveTrait): XGBoostRegressor.this.type

def setEta(value: Double): XGBoostRegressor.this.type

def setEvalMetric(value: String): XGBoostRegressor.this.type

def setEvalSets(evalSets: Map[String, DataFrame]): XGBoostRegressor.this.type

def setFeaturesCol(value: String): XGBoostRegressor

def setGamma(value: Double): XGBoostRegressor.this.type

def setGroupCol(value: String): XGBoostRegressor.this.type

def setGrowPolicy(value: String): XGBoostRegressor.this.type

def setLabelCol(value: String): XGBoostRegressor.this.type

def setLambda(value: Double): XGBoostRegressor.this.type

def setLambdaBias(value: Double): XGBoostRegressor.this.type

def setMaxBins(value: Int): XGBoostRegressor.this.type

def setMaxDeltaStep(value: Double): XGBoostRegressor.this.type

def setMaxDepth(value: Int): XGBoostRegressor.this.type

def setMaxLeaves(value: Int): XGBoostRegressor.this.type

def setMaximizeEvaluationMetrics(value: Boolean): XGBoostRegressor.this.type

def setMinChildWeight(value: Double): XGBoostRegressor.this.type

def setMissing(value: Float): XGBoostRegressor.this.type

def setNormalizeType(value: String): XGBoostRegressor.this.type

def setNthread(value: Int): XGBoostRegressor.this.type

def setNumClass(value: Int): XGBoostRegressor.this.type

def setNumEarlyStoppingRounds(value: Int): XGBoostRegressor.this.type

def setNumRound(value: Int): XGBoostRegressor.this.type

def setNumWorkers(value: Int): XGBoostRegressor.this.type

def setObjective(value: String): XGBoostRegressor.this.type

def setObjectiveType(value: String): XGBoostRegressor.this.type

def setPredictionCol(value: String): XGBoostRegressor.this.type

def setRateDrop(value: Double): XGBoostRegressor.this.type

def setRawPredictionCol(value: String): XGBoostRegressor

def setSampleType(value: String): XGBoostRegressor.this.type

def setScalePosWeight(value: Double): XGBoostRegressor.this.type

def setSeed(value: Long): XGBoostRegressor.this.type

def setSilent(value: Int): XGBoostRegressor.this.type

def setSinglePrecisionHistogram(value: Boolean): XGBoostRegressor.this.type

def setSketchEps(value: Double): XGBoostRegressor.this.type

def setSkipDrop(value: Double): XGBoostRegressor.this.type

def setSubsample(value: Double): XGBoostRegressor.this.type

def setThresholds(value: Array[Double]): XGBoostRegressor

def setTimeoutRequestWorkers(value: Long): XGBoostRegressor.this.type

def setTrainTestRatio(value: Double): XGBoostRegressor.this.type

def setTreeMethod(value: String): XGBoostRegressor.this.type

def setUseExternalMemory(value: Boolean): XGBoostRegressor.this.type

def setWeightCol(value: String): XGBoostRegressor.this.type

Added algorithm parameters

Parameter	Description	Type
grow_policy	The depthwiselossltd option is added to control the method of adding new tree nodes to the tree. This parameter takes effect only when tree_method is set to hist.	String. The default value is depthwise. The options are depthwise, lossguide, and depthwiselossltd.
min_loss_ratio	Controls the pruning degree of tree nodes during training. This parameter is valid only when grow_policy is set to depthwiselossltd.	Double. Default value: 0. Value range: [0, 1).
sampling_strategy	Controls the sampling strategy in the training process.	String. The default value is eachTree. The options are eachTree, eachIteration, alliteration, multiIteration, and gossStyle.
enable_bbgen	Determines whether to use the batch Bernoulli bit generation algorithm.	Boolean. The value can be true or false. The default value is false.
sampling_step	Controls the number of sampling rounds. This parameter is valid only when sampling_strategy is set to multiIteration.	Int. Default value: 1. Value range: [1, +∞).
auto_subsample	Determines whether to use the policy of automatically reducing the sampling rate.	Boolean. The value can be true or false. The default value is false.
auto_k	Controls the number of rounds in the automatic sampling rate reduction policy. This parameter is valid only when auto_subsample is set to true.	Int. Default value: 1. Value range: [1, +∞).
auto_subsample_ratio	Sets the ratio for automatically reducing the sampling rate.	Array[Double]. Default value: Array(0.05,0.1,0.2,0.4,0.8,1.0). Value range: (0, 1].
auto_r	Controls the allowed error rate increase caused by the automatic reduction of the sampling rate.	Double. Default value: 0.95. Value range: (0, 1].
rabit_enable_tcp_no_delay	Controls the communication policy in the Rabit engine.	Boolean. The value can be true or false. The default value is false.
random_split_denom	Controls the proportion of candidate split points.	Int. Default value: 1. Value range: [1, +∞).
default_direction	Controls the default direction of default values.	String. The options are left, right, and learn. The default value is learn.

Code interface example:
val xgbRegression = new XGBoostRegressor(param).setLabelCol("label").setFeaturesCol("features")
val model = xgbRegression.fit(train_data)
val predictions = model.transform(test_data)
val evaluator = new RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)

Example usage

package com.bigdata.ml
import java.io.File
import java.lang.System.nanoTime
import org.apache.spark.sql.{Dataset, Row, SparkSession}
import org.apache.spark.SparkConf
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.storage.StorageLevel
import scala.util.Random
import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
import com.typesafe.config.{Config, ConfigFactory}

object Xgboost_test {
  def profile[R](code: => R, t: Long = nanoTime) = (code, nanoTime - t)
  def getSparkSession(): SparkSession = {
    val conf = new SparkConf()
      .setAppName("XGBOOST-SPARK")
    val spark =
        SparkSession
          .builder()
          .config( conf )
          .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    println("SparkSession created successfully!")
    spark;
  }

  def main(args: Array[String]): Unit = {
    val config = ConfigFactory.parseFile(new File(args(0)))
    // set seed
    Random.setSeed(System.currentTimeMillis())
    // set session
    val spark = this.getSparkSession()
    println("created spark session")
    println(spark.sparkContext.getConf.toDebugString)
    val (result, time) = profile(test(spark, config))
    val time_sec = time.asInstanceOf[Long].toDouble * 1.0e-9
    println(s"Profiling complete in $time_sec seconds. ")
  }

  def test(spark: SparkSession,
           config: Config): Unit = {
    var param = Map[String, Any]()
    val it = config.entrySet.iterator
    while (it.hasNext) {
      val entry = it.next
      param += (entry.getKey -> entry.getValue.unwrapped)
    }
    if (!config.hasPath("allow_non_zero_for_missing")) {
      param += ("allow_non_zero_for_missing" -> true)
    }
    println(param.mkString(";\n"))
    val xgbRegression = new XGBoostRegressor(param)
      .setLabelCol("label")
      .setFeaturesCol("features")
    val time_point1 = System.currentTimeMillis()
    val train_data = getTrainData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER)
    val time_point2 = System.currentTimeMillis()
    val model = xgbRegression.fit(train_data)
    val time_point3 = System.currentTimeMillis()
    val test_data = getTestData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER)
    val predictions = model.transform(test_data)
    val time_point4 = System.currentTimeMillis()
    val load_time = (time_point2 - time_point1) / 1000.0
    println(s"Loading complete in $load_time seconds.")
    val training_time = (time_point3 - time_point2) / 1000.0
    println(s"Training complete in $training_time seconds.")
    val testing_time = (time_point4 - time_point3) / 1000.0
    println(s"Testing complete in $testing_time seconds.")

    // Select (prediction, true label) and compute test error.
    val evaluator = new RegressionEvaluator()
      .setLabelCol("label")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")
    val accuracy = evaluator.evaluate(predictions)
    println(s"Test Error = ${(1.0 - accuracy)}")

    // Select example rows to display.
    predictions.select("prediction", "label", "features").show(5)
  }
  def getTrainData(spark: SparkSession, config: Config): Dataset[Row] = {
    val tr_fname = config.getString("tr_fname")
    println("tr_fname", tr_fname)
    var reader =  spark
      .read
      .format("libsvm")
      .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse")
    if(config.hasPath("numFeatures")) {
      val numFeatures = config.getInt("numFeatures")
      println("numFeatures", numFeatures)
      reader = reader.option("numFeatures", numFeatures)
    }
    val tr_data = reader
      .load(tr_fname)
    tr_data
  }

  def getTestData(spark: SparkSession, config: Config): Dataset[Row] = {
    val ts_fname = config.getString("ts_fname")
    println("ts_fname", ts_fname)
    var reader =  spark
      .read
      .format("libsvm")
      .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse")
    if(config.hasPath("numFeatures")) {
      val numFeatures = config.getInt("numFeatures")
      println("numFeatures", numFeatures)
      reader = reader.option("numFeatures", numFeatures)
    }

    val ts_data = reader
      .load(ts_fname)
    ts_data
  }
}

Example result

Test Error =  0.5872398843658918
+--------------------+-----+--------------------+
|          prediction|label|            features|
+--------------------+-----+--------------------+
|  0.2738455533981323|  0.0|(28,[0,1,2,3,4,5,...|
|0.052151769399642944|  0.0|(28,[0,1,2,3,4,5,...|
| 0.08468279242515564|  0.0|(28,[0,1,2,3,4,5,...|
| 0.20581847429275513|  0.0|(28,[0,1,2,3,4,5,...|
|  0.3741578459739685|  0.0|(28,[0,1,2,3,4,5,...|
+--------------------+-----+--------------------+
only showing top 5 rows

Parent topic: Developing an Application