Rate This Document
Findability
Accuracy
Completeness
Readability

Generating Datasets

Generating the CP10M1K Dataset

  1. Set the HiBench configuration file.
    1. Go to the HiBench-HiBench-7.0/conf directory.
      cd /HiBench-HiBench-7.0/conf
    2. Open the /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf configuration file.
      vim /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf

      Modify the configuration as follows:

      hibench.svd.bigdata.examples            10000000
      hibench.svd.bigdata.features            1000
      hibench.workload.input                  ${hibench.hdfs.data.dir}/CP10M1K
    3. Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-7.0/conf/hibench.conf configuration file as shown in the following figure.
      cd /HiBench-HiBench-7.0/conf
      vim hibench.conf

  2. Generate a dataset.
    1. Create a directory for storing the generated data in HDFS.
      hdfs dfs -mkdir -p /tmp/ml/dataset/
    2. Go to the path where the execution script resides.
      cd /HiBench-HiBench-7.0/bin/workloads/ml/svd/prepare/
      1. Run the script to generate the CP10M1K dataset.
        sh prepare.sh
      2. View the result.
        hadoop fs -ls /HiBench/CP10M1K
      3. If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
  3. Create a folder in HDFS.
    hadoop fs -mkdir -p /tmp/ml/dataset
  4. Start spark-shell.
    spark-shell
  5. Run the following command (do not omit the colon):
    :paste
  6. Execute the following code to process the dataset:
    import org.apache.spark.internal.Logging
    import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
    import org.apache.spark.ml.linalg.{Matrix, Vectors}
    import org.apache.spark.mllib.linalg.DenseVector
    import org.apache.spark.sql.{Row, SparkSession}
    import org.apache.spark.sql.types.{StructField, StructType}
    import org.apache.spark.storage.StorageLevel
    val dataPath = "/HiBench/CP10M1K"
    val outputPath = "/tmp/ml/dataset/CP10M1K"
    spark
    .sparkContext
    .objectFile[DenseVector](dataPath)
    .map(row => Vectors.dense(row.values).toArray.map{u=>f"$u%.2f"}.mkString(","))
    .saveAsTextFile(outputPath)
  7. Check the HDFS directory. The following figure shows the result.
    hadoop fs -ls /tmp/ml/dataset/CP10M1K

Generating the CP2M5K Dataset

  1. Set the HiBench configuration file.
    1. Go to the HiBench-HiBench-7.0/conf directory.
      cd /HiBench-HiBench-7.0/conf
    2. Modify the /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf configuration file.
      vim /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf

      Modify the configuration as follows:

      hibench.svd.bigdata.examples            2000000
      hibench.svd.bigdata.features            5000
      hibench.workload.input                  ${hibench.hdfs.data.dir}/CP2M5K
    3. Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-7.0/conf/hibench.conf configuration file.
      cd /HiBench-HiBench-7.0/conf
      vim hibench.conf

      Modify the configuration as follows:

      hibench.scale.profile                 bigdata
      hibench.default.map.parallelism       500
      hibench.default.shuffle.parallelism   600

  2. Generate data.
    1. Create a directory for storing the generated data in HDFS.
      hdfs dfs -mkdir -p /tmp/ml/dataset/
    2. Go to the path where the execution script resides.
      cd /HiBench-HiBench-7.0/bin/workloads/ml/svd/prepare/
      1. Run the script to generate the CP2M5K dataset.
        sh prepare.sh
      1. If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
  3. Create a folder in HDFS.
    hadoop fs -mkdir -p /tmp/ml/dataset
  4. Start spark-shell.
    spark-shell
  5. Run the following command (do not omit the colon):
    :paste
  6. Execute the following code to process the dataset:
    import org.apache.spark.internal.Logging
    import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
    import org.apache.spark.ml.linalg.{Matrix, Vectors}
    import org.apache.spark.mllib.linalg.DenseVector
    import org.apache.spark.sql.{Row, SparkSession}
    import org.apache.spark.sql.types.{StructField, StructType}
    import org.apache.spark.storage.StorageLevel
    val dataPath = "/HiBench/CP2M5K"
    val outputPath = "/tmp/ml/dataset/CP2M5K"
    spark
    .sparkContext
    .objectFile[DenseVector](dataPath)
    .map(row => Vectors.dense(row.values).toArray.map{u=>f"$u%.2f"}.mkString(","))
    .saveAsTextFile(outputPath)
  7. Check the HDFS directory. The following figure shows the result.
    hadoop fs -ls /tmp/ml/dataset/CP2M5K

Generating the ALS Dataset

  1. Set the HiBench configuration file.
    1. Go to the HiBench-HiBench-7.0/conf directory.
      cd /HiBench-HiBench-7.0/conf
    2. Modify the /HiBench-HiBench-7.0/conf/workloads/ml/als.conf configuration file as follows:
      vim /HiBench-HiBench-7.0/conf/workloads/ml/als.conf

    3. Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-7.0/conf/hibench.conf configuration file.
      cd /HiBench-HiBench-7.0/conf
      vim hibench.conf

      Modify the configuration as follows:

  2. Generate data.
    1. Create a directory for storing the generated data in HDFS.
      hdfs dfs -mkdir -p /tmp/ml/dataset/ALS
    2. Go to the path where the execution script resides.
      cd /HiBench-HiBench-7.0/bin/workloads/ml/als/prepare/
    3. Run the script to generate the ALS dataset.
      sh prepare.sh
    4. View the result.
      hadoop fs -ls /tmp/ml/dataset/ALS

      If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.

Generating the D200M100 Dataset

  1. Set the HiBench configuration file.
    1. Go to the HiBench-HiBench-7.0/conf directory.
      cd /HiBench-HiBench-7.0/conf
    2. Modify the /HiBench-HiBench-7.0/conf/workloads/ml/kmeans.conf configuration file.
      vim /HiBench-HiBench-7.0/conf/workloads/ml/kmeans.conf

      Modify the configuration as follows:

      hibench.kmeans.gigantic.num_of_clusters		5
      hibench.kmeans.gigantic.dimensions		100
      hibench.kmeans.gigantic.num_of_samples		200000000
      hibench.kmeans.gigantic.samples_per_inputfile	40000000
      hibench.kmeans.gigantic.max_iteration		5
      hibench.kmeans.gigantic.k			10
      hibench.kmeans.gigantic.convergedist		0.5
      
       hibench.workload.input                         hdfs://server1:8020/tmp/ml/dataset/kmeans_200m20_tmp

    3. Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-HiBench-7.0/conf/hibench.conf configuration file as follows:
      cd /HiBench-HiBench-7.0/conf
      vim hibench.conf

  2. Generate data.
    1. Create a directory for storing the generated data in HDFS.
      hdfs dfs -mkdir -p /tmp/ml/dataset/kmeans_200m100_tmp
    2. Go to the path where the execution script resides.
      cd /HiBench-HiBench-7.0/bin/workloads/ml/kmeans/prepare/
    3. Run the script to generate the D200M100 dataset.
      sh prepare.sh
  3. View the result.
    hdfs dfs -ls /tmp/ml/dataset/kmeans_200m100_tmp

    If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.

  4. Create a directory for storing the generated dataset in HDFS.
    hdfs dfs -ls /tmp/ml/dataset/kmeans_200m100
  5. Move the dataset to a specified path.
    hdfs dfs -mv /tmp/ml/dataset/kmeans_200m100_tmp/samples/* /tmp/ml/dataset/kmeans_200m100/
  6. View the result.
    hdfs dfs -ls /tmp/ml/dataset/kmeans_200m100/

  7. Delete redundant directories.
    hdfs dfs -rm -r /tmp/ml/dataset/kmeans_200m100_tmp

Generating the D10M4096 Dataset

  1. Set the HiBench configuration file.
    1. Go to the HiBench-HiBench-7.0/conf directory.
      cd /HiBench-HiBench-7.0/conf
    2. Modify the /HiBench-HiBench-7.0/conf/workloads/ml/lr.conf configuration file.
      vim /HiBench-HiBench-7.0/conf/workloads/ml/lr.conf

      Modify the configuration as follows: (Change the number of data samples to 10 million and the number of data features to 4096 to generate a large-scale dataset with 40 million samples.)

      hibench.lr.bigdata.examples  10000000
      hibench.lr.bigdata.features  4096

    3. Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-HiBench-7.0/conf/hibench.conf configuration file.
      cd /HiBench-HiBench-7.0/conf
      vim hibench.conf

      Modify the configuration as follows:

      hibench.scale.profile                 bigdata
      hibench.default.map.parallelism       300
      hibench.default.shuffle.parallelism   300

  2. Generate data.
    1. Create a directory for storing the generated data in HDFS.
      hdfs dfs -mkdir -p /tmp/ml/dataset/
    2. Go to the path where the execution script resides.
      cd /HiBench-HiBench-7.0/bin/workloads/ml/lr/prepare/
    3. Run the script to generate the D10M4096 dataset.
      sh prepare.sh
  3. View the result.
    hdfs dfs -ls /HiBench/HiBench/LR/Input

    If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.

  4. Start spark-shell.
    spark-shell
  5. Run the following command (do not omit the colon):
    :paste
  6. Execute the following code to process the dataset:
    import org.apache.spark.rdd.RDD
    import org.apache.spark.mllib.regression.LabeledPoint
    val data: RDD[LabeledPoint] = sc.objectFile("/HiBench/HiBench/LR/Input/10m4096")
    val i = data.map{t=>t.label.toString+","+t.features.toArray.mkString(" ")}
    val splits = i.randomSplit(Array(0.6, 0.4), seed = 11L)
    splits(0).saveAsTextFile("/HiBench/HiBench/LR/Output/10m4096_train")
    splits(1).saveAsTextFile("/HiBench/HiBench/LR/Output/10m4096_test")

Generating the HiBench_10M_200M Dataset

  1. Set the HiBench configuration file.
    1. Go to the HiBench-HiBench-7.0/conf directory.
      cd /HiBench-HiBench-7.0/conf
    2. Modify the /HiBench-HiBench-7.0/conf/workloads/ml/lda.conf file as follows:
      hibench.lda.bigdata.num_of_documents              10000000
      hibench.lda.bigdata.num_of_vocabulary             200007152
      hibench.lda.bigdata.num_of_topics                 100
      hibench.lda.bigdata.doc_len_min                   500
      hibench.lda.bigdata.doc_len_max                   10000
      hibench.lda.bigdata.maxresultsize                 "6g"
      hibench.lda.num_of_documents                      ${hibench.lda.${hibench.scale.profile}.num_of_documents}
      hibench.lda.num_of_vocabulary                     ${hibench.lda.${hibench.scale.profile}.num_of_vocabulary}
      hibench.lda.num_of_topics                         ${hibench.lda.${hibench.scale.profile}.num_of_topics}
      hibench.lda.doc_len_min                           ${hibench.lda.${hibench.scale.profile}.doc_len_min}
      hibench.lda.doc_len_max                           ${hibench.lda.${hibench.scale.profile}.doc_len_max}
      hibench.lda.maxresultsize                         ${hibench.lda.${hibench.scale.profile}.maxresultsize}
      hibench.lda.partitions                            ${hibench.default.map.parallelism}
      hibench.lda.optimizer                             "online"
      hibench.lda.num_iterations                        10
  2. Generate data.
    1. Create a directory for storing the generated data in HDFS.
      hdfs dfs -mkdir -p /tmp/ml/dataset/
    2. Go to the path where the execution script resides.
      cd /HiBench-HiBench-7.0/bin/workloads/ml/lda/prepare/
    3. Run bin/workloads/ml/lda/prepare/prepare.sh to generate a dataset.
      sh prepare.sh
  3. Start spark-shell.
    spark-shell
  4. Run the following command (do not omit the colon):
    :paste
  5. Run the following code to convert the generated data into the ORC format:
    import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => OldVectors} 
    import org.apache.spark.ml.linalg.{Vector, Vectors}
    case class DocSchema(id: Long, tf: Vector)
    val data: RDD[(Long, OldVector)] = sc.objectFile(dataPath)
    val df = spark.createDataFrame(data.map {doc => DocSchema(doc._1, doc._2.asML)})
    df.repartition(200).write.mode("overwrite").format("orc").save(outputPath)

Generating the HibenchRating3wx3w Dataset

  1. Modify the numPartitions parameter in the Hibench/sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/RatingDataGenerator.scala file. (As shown in the following figure, comment out lines 36 and 37 and add line 38.)
    val numPartitions = parallel

  2. Generate data.
    1. Compile the sparkbench module.
      mvn package
    2. Save the compiled sparkbench-common-8.0-SNAPSHOT.jar and sparkbench-ml-8.0-SNAPSHOT.jar files in the same folder and call RatingDataGenerator to generate data.
      spark-submit \
      --class com.intel.hibench.sparkbench.ml.RatingDataGenerator \
      --jars sparkbench-common-8.0-SNAPSHOT.jar \
      --conf "spark.executor.instances=71" \
      --conf "spark.executor.cores=4" \
      --conf "spark.executor.memory=12g" \
      --conf "spark.executor.memoryOverhead=2g" \
      --conf "spark.default.parallelism=284" \
      --master yarn \
      --deploy-mode client \
      --driver-cores 36 \
      --driver-memory 50g \
      ./sparkbench-ml-8.0-SNAPSHOT.jar \
      /tmp/hibench/HibenchRating3wx3w 24000 6000 900000 false

      Parameters:

      • /tmp/hibench/HibenchRating3wx3w: location where the generated data is stored.
      • 24000: number of users.
      • 6000: number of products.
      • 900000: number of ratings.
      • false: Implicit feedback data is not generated.