我要评分
获取效率
正确性
完整性
易理解

XGBoost

The XGBoost algorithm has two types of model interfaces: ML Classification API and ML Regression API.

Model API Type

Function API

ML Classification API

def fit(dataset: Dataset[_]): XGBoostClassificationModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[XGBoostClassificationModel]

def fit(dataset: Dataset[_], paramMap: ParamMap): XGBoostClassificationModel

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): XGBoostClassificationModel

ML Regression API

def fit(dataset: Dataset[_]): XGBoostRegressionModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[XGBoostRegressionModel]

def fit(dataset: Dataset[_], paramMap: ParamMap): XGBoostRegressionModel

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): XGBoostRegressionModel

ML Classification API

  • Function

    Import sample data in dataset format, call the training API, and output the XGBoost classification model.

  • Input/Output
    1. Package name: package ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
    2. Class name: XGBoostClassifier
    3. Method name: fit
    4. Input: training sample data (Dataset[_]). The following are mandatory fields:

      Parameter

      Type

      Default Value

      Description

      labelCol

      Double

      label

      Predicted label

      featuresCol

      Vector

      features

      Feature label

    5. Output: XGBoostClassificationModel, which is an XGBoost classification model. The output fields during model prediction are as follows:

      Parameter

      Type

      Example

      Description

      predictionCol

      Double

      prediction

      Predicted label value

    6. Algorithm parameters

      Algorithm Parameters

      def setAllowNonZeroForMissing(value: Boolean): XGBoostClassifier.this.type

      def setAlpha(value: Double): XGBoostClassifier.this.type

      def setBaseMarginCol(value: String): XGBoostClassifier.this.type

      def setBaseScore(value: Double): XGBoostClassifier.this.type

      def setCheckpointInterval(value: Int): XGBoostClassifier.this.type

      def setCheckpointPath(value: String): XGBoostClassifier.this.type

      def setColsampleBylevel(value: Double): XGBoostClassifier.this.type

      def setColsampleBytree(value: Double): XGBoostClassifier.this.type

      def setCustomEval(value: EvalTrait): XGBoostClassifier.this.type

      def setCustomObj(value: ObjectiveTrait): XGBoostClassifier.this.type

      def setEta(value: Double): XGBoostClassifier.this.type

      def setEvalMetric(value: String): XGBoostClassifier.this.type

      def setEvalSets(evalSets: Map[String, DataFrame]): XGBoostClassifier.this.type

      def setFeaturesCol(value: String): XGBoostClassifier

      def setGamma(value: Double): XGBoostClassifier.this.type

      def setGrowPolicy(value: String): XGBoostClassifier.this.type

      def setLabelCol(value: String): XGBoostClassifier.this.type

      def setLambda(value: Double): XGBoostClassifier.this.type

      def setLambdaBias(value: Double): XGBoostClassifier.this.type

      def setMaxBins(value: Int): XGBoostClassifier.this.type

      def setMaxDeltaStep(value: Double): XGBoostClassifier.this.type

      def setMaxDepth(value: Int): XGBoostClassifier.this.type

      def setMaxLeaves(value: Int): XGBoostClassifier.this.type

      def setMaximizeEvaluationMetrics(value: Boolean): XGBoostClassifier.this.type

      def setMinChildWeight(value: Double): XGBoostClassifier.this.type

      def setMissing(value: Float): XGBoostClassifier.this.type

      def setNormalizeType(value: String): XGBoostClassifier.this.type

      def setNthread(value: Int): XGBoostClassifier.this.type

      def setNumClass(value: Int): XGBoostClassifier.this.type

      def setNumEarlyStoppingRounds(value: Int): XGBoostClassifier.this.type

      def setNumRound(value: Int): XGBoostClassifier.this.type

      def setNumWorkers(value: Int): XGBoostClassifier.this.type

      def setObjective(value: String): XGBoostClassifier.this.type

      def setObjectiveType(value: String): XGBoostClassifier.this.type

      def setPredictionCol(value: String): XGBoostClassifier

      def setProbabilityCol(value: String): XGBoostClassifier

      def setRateDrop(value: Double): XGBoostClassifier.this.type

      def setRawPredictionCol(value: String): XGBoostClassifier.this.type

      def setSampleType(value: String): XGBoostClassifier.this.type

      def setScalePosWeight(value: Double): XGBoostClassifier.this.type

      def setSeed(value: Long): XGBoostClassifier.this.type

      def setSilent(value: Int): XGBoostClassifier.this.type

      def setSinglePrecisionHistogram(value: Boolean): XGBoostClassifier.this.type

      def setSketchEps(value: Double): XGBoostClassifier.this.type

      def setSkipDrop(value: Double): XGBoostClassifier.this.type

      def setSubsample(value: Double): XGBoostClassifier.this.type

      def setThresholds(value: Array[Double]): XGBoostClassifier

      def setTimeoutRequestWorkers(value: Long): XGBoostClassifier.this.type

      def setTrainTestRatio(value: Double): XGBoostClassifier.this.type

      def setTreeMethod(value: String): XGBoostClassifier.this.type

      def setUseExternalMemory(value: Boolean): XGBoostClassifier.this.type

      def setWeightCol(value: String): XGBoostClassifier.this.type

    7. Added algorithm parameters

      Parameter

      Description

      Type

      grow_policy

      The depthwiselossltd option is added to control the method of adding new tree nodes to the tree. This parameter takes effect only when tree_method is set to hist.

      String. The default value is depthwise. The options are depthwise, lossguide, and depthwiselossltd.

      min_loss_ratio

      Controls the pruning degree of tree nodes during training. This parameter is valid only when grow_policy is set to depthwiselossltd.

      Double. Default value: 0. Value range: [0, 1).

      sampling_strategy

      Controls the sampling strategy in the training process.

      String. The default value is eachTree. The options are eachTree, eachIteration, alliteration, multiIteration, and gossStyle.

      enable_bbgen

      Determines whether to use the batch Bernoulli bit generation algorithm.

      Boolean. The value can be true or false. The default value is false.

      sampling_step

      Controls the number of sampling rounds. This parameter is valid only when sampling_strategy is set to multiIteration.

      Int. Default value: 1. Value range: [1, +∞).

      auto_subsample

      Determines whether to use the policy of automatically reducing the sampling rate.

      Boolean. The value can be true or false. The default value is false.

      auto_k

      Controls the number of rounds in the automatic sampling rate reduction policy. This parameter is valid only when auto_subsample is set to true.

      Int. Default value: 1. Value range: [1, +∞).

      auto_subsample_ratio

      Sets the ratio for automatically reducing the sampling rate.

      Array[Double]. Default value: Array(0.05, 0.1, 0.2, 0.4, 0.8, 1.0). Value range: (0, 1].

      auto_r

      Controls the allowed error rate increase caused by the automatic reduction of the sampling rate.

      Double. Default value: 0.95. Value range: (0, 1].

      rabit_enable_tcp_no_delay

      Controls the communication policy in the Rabit engine.

      Boolean. The value can be true or false. The default value is false.

      random_split_denom

      Controls the proportion of candidate split points.

      Int. Default value: 1. Value range: [1, +∞).

      default_direction

      Controls the default direction of default values.

      String. The options are left, right, and learn. The default value is learn.

      Code interface example:

      1
      2
      3
      4
      5
      val xgbClassifier = new XGBoostClassifier(param).setLabelCol("label").setFeaturesCol("features")
      val model = xgbClassifier.fit(train_data)
      val predictions = model.transform(test_data)
      val evaluator = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")  .setMetricName("accuracy")
      val accuracy = evaluator.evaluate(predictions)
      
      • Example usage
        package com.bigdata.ml
        import java.io.File
        import java.lang.System.nanoTime
        import org.apache.spark.sql.{Dataset, Row, SparkSession}
        import org.apache.spark.SparkConf
        import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
        import org.apache.spark.storage.StorageLevel
        import scala.util.Random
        import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
        import com.typesafe.config.{Config, ConfigFactory}
        
        object Xgboost_test {
          def profile[R](code: => R, t: Long = nanoTime) = (code, nanoTime - t)
          def getSparkSession(): SparkSession = {
            val conf = new SparkConf()
              .setAppName("XGBOOST-SPARK")
            val spark =
                SparkSession
                  .builder()
                  .config( conf )
                  .getOrCreate()
            spark.sparkContext.setLogLevel("ERROR")
            println("SparkSession created successfully!")
            spark;
          }
        
          def main(args: Array[String]): Unit = {
            val config = ConfigFactory.parseFile(new File(args(0)))
            // set seed
            Random.setSeed(System.currentTimeMillis())
            // set session
            val spark = this.getSparkSession()
            println("created spark session")
            println(spark.sparkContext.getConf.toDebugString)
            val (result, time) = profile(test(spark, config))
            val time_sec = time.asInstanceOf[Long].toDouble * 1.0e-9
            println(s"Profiling complete in $time_sec seconds. ")
          }
        
          def test(spark: SparkSession,
                   config: Config): Unit = {
            var param = Map[String, Any]()
            val it = config.entrySet.iterator
            while (it.hasNext) {
              val entry = it.next
              param += (entry.getKey -> entry.getValue.unwrapped)
            }
            if (!config.hasPath("allow_non_zero_for_missing")) {
              param += ("allow_non_zero_for_missing" -> true)
            }
            println(param.mkString(";\n"))
            val xgbClassifier = new XGBoostClassifier(param)
              .setLabelCol("label")
              .setFeaturesCol("features")
            val time_point1 = System.currentTimeMillis()
            val train_data = getTrainData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER)
            val time_point2 = System.currentTimeMillis()
            val model = xgbClassifier.fit(train_data)
            val time_point3 = System.currentTimeMillis()
            val test_data = getTestData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER)
            val predictions = model.transform(test_data)
            val time_point4 = System.currentTimeMillis()
            val load_time = (time_point2 - time_point1) / 1000.0
            println(s"Loading complete in $load_time seconds.")
            val training_time = (time_point3 - time_point2) / 1000.0
            println(s"Training complete in $training_time seconds.")
            val testing_time = (time_point4 - time_point3) / 1000.0
            println(s"Testing complete in $testing_time seconds.")
        
            // Select (prediction, true label) and compute test error.
            val evaluator = new MulticlassClassificationEvaluator()
              .setLabelCol("label")
              .setPredictionCol("prediction")
              .setMetricName("accuracy")
            val accuracy = evaluator.evaluate(predictions)
            println(s"Test Error = ${(1.0 - accuracy)}")
        
            // Select example rows to display.
            predictions.select("prediction", "label", "features").show(5)
          }
          def getTrainData(spark: SparkSession, config: Config): Dataset[Row] = {
            val tr_fname = config.getString("tr_fname")
            println("tr_fname", tr_fname)
            var reader =  spark
              .read
              .format("libsvm")
              .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse")
            if(config.hasPath("numFeatures")) {
              val numFeatures = config.getInt("numFeatures")
              println("numFeatures", numFeatures)
              reader = reader.option("numFeatures", numFeatures)
            }
            val tr_data = reader
              .load(tr_fname)
            tr_data
          }
        
          def getTestData(spark: SparkSession, config: Config): Dataset[Row] = {
            val ts_fname = config.getString("ts_fname")
            println("ts_fname", ts_fname)
            var reader =  spark
              .read
              .format("libsvm")
              .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse")
            if(config.hasPath("numFeatures")) {
              val numFeatures = config.getInt("numFeatures")
              println("numFeatures", numFeatures)
              reader = reader.option("numFeatures", numFeatures)
            }
        
            val ts_data = reader
              .load(ts_fname)
            ts_data
          }
        }
      • Example result
        Test Error =  0.253418287207109
        +----------+-----+--------------------+
        |prediction|label|            features|
        +----------+-----+--------------------+
        |       0.0|  0.0|(28,[0,1,2,3,4,5,...|
        |       0.0|  0.0|(28,[0,1,2,3,4,5,...|
        |       0.0|  0.0|(28,[0,1,2,3,4,5,...|
        |       0.0|  0.0|(28,[0,1,2,3,4,5,...|
        |       0.0|  0.0|(28,[0,1,2,3,4,5,...|
        +----------+-----+--------------------+
        only showing top 5 rows

ML Regression API

  • Function

    Import sample data in dataset format, call the training API, and output the XGBoost classification model.

  • Input/Output
    1. Package name: package ml.dmlc.xgboost4j.scala.spark.XGBoostRegressor
    2. Class name: XGBoostRegressor
    3. Method name: fit
    4. Input: training sample data (Dataset[_]). The following are mandatory fields:

      Parameter

      Type

      Default Value

      Description

      labelCol

      Double

      label

      Predicted label

      featuresCol

      Vector

      features

      Feature label

    5. Output: XGBoostClassificationModel, which is an XGBoost classification model. The output fields during model prediction are as follows:

      Parameter

      Type

      Example

      Description

      predictionCol

      Double

      prediction

      Predicted label value

    6. Algorithm parameters

      Algorithm Parameters

      def setAllowNonZeroForMissing(value: Boolean): XGBoostRegressor.this.type

      def setAlpha(value: Double): XGBoostRegressor.this.type

      def setBaseMarginCol(value: String): XGBoostRegressor.this.type

      def setBaseScore(value: Double): XGBoostRegressor.this.type

      def setCheckpointInterval(value: Int): XGBoostRegressor.this.type

      def setCheckpointPath(value: String): XGBoostRegressor.this.type

      def setColsampleBylevel(value: Double): XGBoostRegressor.this.type

      def setColsampleBytree(value: Double): XGBoostRegressor.this.type

      def setCustomEval(value: EvalTrait): XGBoostRegressor.this.type

      def setCustomObj(value: ObjectiveTrait): XGBoostRegressor.this.type

      def setEta(value: Double): XGBoostRegressor.this.type

      def setEvalMetric(value: String): XGBoostRegressor.this.type

      def setEvalSets(evalSets: Map[String, DataFrame]): XGBoostRegressor.this.type

      def setFeaturesCol(value: String): XGBoostRegressor

      def setGamma(value: Double): XGBoostRegressor.this.type

      def setGroupCol(value: String): XGBoostRegressor.this.type

      def setGrowPolicy(value: String): XGBoostRegressor.this.type

      def setLabelCol(value: String): XGBoostRegressor.this.type

      def setLambda(value: Double): XGBoostRegressor.this.type

      def setLambdaBias(value: Double): XGBoostRegressor.this.type

      def setMaxBins(value: Int): XGBoostRegressor.this.type

      def setMaxDeltaStep(value: Double): XGBoostRegressor.this.type

      def setMaxDepth(value: Int): XGBoostRegressor.this.type

      def setMaxLeaves(value: Int): XGBoostRegressor.this.type

      def setMaximizeEvaluationMetrics(value: Boolean): XGBoostRegressor.this.type

      def setMinChildWeight(value: Double): XGBoostRegressor.this.type

      def setMissing(value: Float): XGBoostRegressor.this.type

      def setNormalizeType(value: String): XGBoostRegressor.this.type

      def setNthread(value: Int): XGBoostRegressor.this.type

      def setNumClass(value: Int): XGBoostRegressor.this.type

      def setNumEarlyStoppingRounds(value: Int): XGBoostRegressor.this.type

      def setNumRound(value: Int): XGBoostRegressor.this.type

      def setNumWorkers(value: Int): XGBoostRegressor.this.type

      def setObjective(value: String): XGBoostRegressor.this.type

      def setObjectiveType(value: String): XGBoostRegressor.this.type

      def setPredictionCol(value: String): XGBoostRegressor.this.type

      def setRateDrop(value: Double): XGBoostRegressor.this.type

      def setRawPredictionCol(value: String): XGBoostRegressor

      def setSampleType(value: String): XGBoostRegressor.this.type

      def setScalePosWeight(value: Double): XGBoostRegressor.this.type

      def setSeed(value: Long): XGBoostRegressor.this.type

      def setSilent(value: Int): XGBoostRegressor.this.type

      def setSinglePrecisionHistogram(value: Boolean): XGBoostRegressor.this.type

      def setSketchEps(value: Double): XGBoostRegressor.this.type

      def setSkipDrop(value: Double): XGBoostRegressor.this.type

      def setSubsample(value: Double): XGBoostRegressor.this.type

      def setThresholds(value: Array[Double]): XGBoostRegressor

      def setTimeoutRequestWorkers(value: Long): XGBoostRegressor.this.type

      def setTrainTestRatio(value: Double): XGBoostRegressor.this.type

      def setTreeMethod(value: String): XGBoostRegressor.this.type

      def setUseExternalMemory(value: Boolean): XGBoostRegressor.this.type

      def setWeightCol(value: String): XGBoostRegressor.this.type

    7. Added algorithm parameters

      Parameter

      Description

      Type

      grow_policy

      The depthwiselossltd option is added to control the method of adding new tree nodes to the tree. This parameter takes effect only when tree_method is set to hist.

      String. The default value is depthwise. The options are depthwise, lossguide, and depthwiselossltd.

      min_loss_ratio

      Controls the pruning degree of tree nodes during training. This parameter is valid only when grow_policy is set to depthwiselossltd.

      Double. Default value: 0. Value range: [0, 1).

      sampling_strategy

      Controls the sampling strategy in the training process.

      String. The default value is eachTree. The options are eachTree, eachIteration, alliteration, multiIteration, and gossStyle.

      enable_bbgen

      Determines whether to use the batch Bernoulli bit generation algorithm.

      Boolean. The value can be true or false. The default value is false.

      sampling_step

      Controls the number of sampling rounds. This parameter is valid only when sampling_strategy is set to multiIteration.

      Int. Default value: 1. Value range: [1, +∞).

      auto_subsample

      Determines whether to use the policy of automatically reducing the sampling rate.

      Boolean. The value can be true or false. The default value is false.

      auto_k

      Controls the number of rounds in the automatic sampling rate reduction policy. This parameter is valid only when auto_subsample is set to true.

      Int. Default value: 1. Value range: [1, +∞).

      auto_subsample_ratio

      Sets the ratio for automatically reducing the sampling rate.

      Array[Double]. Default value: Array(0.05,0.1,0.2,0.4,0.8,1.0). Value range: (0, 1].

      auto_r

      Controls the allowed error rate increase caused by the automatic reduction of the sampling rate.

      Double. Default value: 0.95. Value range: (0, 1].

      rabit_enable_tcp_no_delay

      Controls the communication policy in the Rabit engine.

      Boolean. The value can be true or false. The default value is false.

      random_split_denom

      Controls the proportion of candidate split points.

      Int.

      Default value: 1. Value range: [1, +∞).

      default_direction

      Controls the default direction of default values.

      String. The options are left, right, and learn. The default value is learn.

      Code interface example:
      val xgbRegression = new XGBoostRegressor(param).setLabelCol("label").setFeaturesCol("features")
      val model = xgbRegression.fit(train_data)
      val predictions = model.transform(test_data)
      val evaluator = new RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("accuracy")
      val accuracy = evaluator.evaluate(predictions)
      • Example usage
        package com.bigdata.ml
        import java.io.File
        import java.lang.System.nanoTime
        import org.apache.spark.sql.{Dataset, Row, SparkSession}
        import org.apache.spark.SparkConf
        import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
        import org.apache.spark.storage.StorageLevel
        import scala.util.Random
        import ml.dmlc.xgboost4j.scala.spark.XGBoostClassifier
        import com.typesafe.config.{Config, ConfigFactory}
        
        object Xgboost_test {
          def profile[R](code: => R, t: Long = nanoTime) = (code, nanoTime - t)
          def getSparkSession(): SparkSession = {
            val conf = new SparkConf()
              .setAppName("XGBOOST-SPARK")
            val spark =
                SparkSession
                  .builder()
                  .config( conf )
                  .getOrCreate()
            spark.sparkContext.setLogLevel("ERROR")
            println("SparkSession created successfully!")
            spark;
          }
        
          def main(args: Array[String]): Unit = {
            val config = ConfigFactory.parseFile(new File(args(0)))
            // set seed
            Random.setSeed(System.currentTimeMillis())
            // set session
            val spark = this.getSparkSession()
            println("created spark session")
            println(spark.sparkContext.getConf.toDebugString)
            val (result, time) = profile(test(spark, config))
            val time_sec = time.asInstanceOf[Long].toDouble * 1.0e-9
            println(s"Profiling complete in $time_sec seconds. ")
          }
        
          def test(spark: SparkSession,
                   config: Config): Unit = {
            var param = Map[String, Any]()
            val it = config.entrySet.iterator
            while (it.hasNext) {
              val entry = it.next
              param += (entry.getKey -> entry.getValue.unwrapped)
            }
            if (!config.hasPath("allow_non_zero_for_missing")) {
              param += ("allow_non_zero_for_missing" -> true)
            }
            println(param.mkString(";\n"))
            val xgbRegression = new XGBoostRegressor(param)
              .setLabelCol("label")
              .setFeaturesCol("features")
            val time_point1 = System.currentTimeMillis()
            val train_data = getTrainData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER)
            val time_point2 = System.currentTimeMillis()
            val model = xgbRegression.fit(train_data)
            val time_point3 = System.currentTimeMillis()
            val test_data = getTestData(spark, config).persist(StorageLevel.MEMORY_AND_DISK_SER)
            val predictions = model.transform(test_data)
            val time_point4 = System.currentTimeMillis()
            val load_time = (time_point2 - time_point1) / 1000.0
            println(s"Loading complete in $load_time seconds.")
            val training_time = (time_point3 - time_point2) / 1000.0
            println(s"Training complete in $training_time seconds.")
            val testing_time = (time_point4 - time_point3) / 1000.0
            println(s"Testing complete in $testing_time seconds.")
        
            // Select (prediction, true label) and compute test error.
            val evaluator = new RegressionEvaluator()
              .setLabelCol("label")
              .setPredictionCol("prediction")
              .setMetricName("accuracy")
            val accuracy = evaluator.evaluate(predictions)
            println(s"Test Error = ${(1.0 - accuracy)}")
        
            // Select example rows to display.
            predictions.select("prediction", "label", "features").show(5)
          }
          def getTrainData(spark: SparkSession, config: Config): Dataset[Row] = {
            val tr_fname = config.getString("tr_fname")
            println("tr_fname", tr_fname)
            var reader =  spark
              .read
              .format("libsvm")
              .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse")
            if(config.hasPath("numFeatures")) {
              val numFeatures = config.getInt("numFeatures")
              println("numFeatures", numFeatures)
              reader = reader.option("numFeatures", numFeatures)
            }
            val tr_data = reader
              .load(tr_fname)
            tr_data
          }
        
          def getTestData(spark: SparkSession, config: Config): Dataset[Row] = {
            val ts_fname = config.getString("ts_fname")
            println("ts_fname", ts_fname)
            var reader =  spark
              .read
              .format("libsvm")
              .option("vectorType", if (config.hasPath("vectorType")) config.getString("vectorType") else "sparse")
            if(config.hasPath("numFeatures")) {
              val numFeatures = config.getInt("numFeatures")
              println("numFeatures", numFeatures)
              reader = reader.option("numFeatures", numFeatures)
            }
        
            val ts_data = reader
              .load(ts_fname)
            ts_data
          }
        }
      • Example result
        Test Error =  0.5872398843658918
        +--------------------+-----+--------------------+
        |          prediction|label|            features|
        +--------------------+-----+--------------------+
        |  0.2738455533981323|  0.0|(28,[0,1,2,3,4,5,...|
        |0.052151769399642944|  0.0|(28,[0,1,2,3,4,5,...|
        | 0.08468279242515564|  0.0|(28,[0,1,2,3,4,5,...|
        | 0.20581847429275513|  0.0|(28,[0,1,2,3,4,5,...|
        |  0.3741578459739685|  0.0|(28,[0,1,2,3,4,5,...|
        +--------------------+-----+--------------------+
        only showing top 5 rows