Generating Datasets
Generating a CP10M1K Dataset
- Modify the HiBench configuration file.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf configuration file.
1vi /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf - Press i to enter the insert mode and modify the file as follows:
1 2 3
hibench.svd.bigdata.examples 10000000 hibench.svd.bigdata.features 1000 hibench.workload.input ${hibench.hdfs.data.dir}/CP10M1K
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-7.0/conf/hibench.conf configuration file.
1 2
cd /HiBench-HiBench-7.0/conf vi hibench.conf
- Press i to enter the insert mode and modify the file as follows:

- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
- Generate a dataset.
- Create a directory for storing the generated data in HDFS.
1hdfs dfs -mkdir -p /tmp/ml/dataset/
- Go to the path where the execution script resides.
1cd /HiBench-HiBench-7.0/bin/workloads/ml/svd/prepare/
- Run the script to generate the CP10M1K dataset.
1sh prepare.sh - View the result.
1hadoop fs -ls /HiBench/CP10M1K
- If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
- Create a directory for storing the generated data in HDFS.
- Create a folder in HDFS.
1hadoop fs -mkdir -p /tmp/ml/dataset
- Start spark-shell.
1spark-shell
- Run the following command:
1:paste
- Execute the following code to process the dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
import org.apache.spark.internal.Logging import org.apache.spark.ml.linalg.SQLDataTypes.VectorType import org.apache.spark.ml.linalg.{Matrix, Vectors} import org.apache.spark.mllib.linalg.DenseVector import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{StructField, StructType} import org.apache.spark.storage.StorageLevel val dataPath = "/HiBench/CP10M1K" val outputPath = "/tmp/ml/dataset/CP10M1K" spark .sparkContext .objectFile[DenseVector](dataPath) .map(row => Vectors.dense(row.values).toArray.map{u=>f"$u%.2f"}.mkString(",")) .saveAsTextFile(outputPath)
- Check the HDFS directory. The following figure shows the result.
1hadoop fs -ls /tmp/ml/dataset/CP10M1K

Generating a CP2M5K Dataset
- Modify the HiBench configuration file.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf configuration file.
1vi /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf - Press i to enter the insert mode and modify the file as follows:
1 2 3
hibench.svd.bigdata.examples 2000000 hibench.svd.bigdata.features 5000 hibench.workload.input ${hibench.hdfs.data.dir}/CP2M5K
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-7.0/conf/hibench.conf configuration file.
1vi hibench.conf - Press i to enter the insert mode and modify the file as follows:
1 2 3
hibench.scale.profile bigdata hibench.default.map.parallelism 500 hibench.default.shuffle.parallelism 600

- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
- Generate data.
- Create a directory for storing the generated data in HDFS.
1hdfs dfs -mkdir -p /tmp/ml/dataset/
- Go to the path where the execution script resides.
1cd /HiBench-HiBench-7.0/bin/workloads/ml/svd/prepare/
- Run the script to generate the CP2M5K dataset.
1sh prepare.sh - If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
- Create a directory for storing the generated data in HDFS.
- Create a folder in HDFS.
1hadoop fs -mkdir -p /tmp/ml/dataset
- Start spark-shell.
1spark-shell
- Run the following command:
1:paste
- Execute the following code to process the dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
import org.apache.spark.internal.Logging import org.apache.spark.ml.linalg.SQLDataTypes.VectorType import org.apache.spark.ml.linalg.{Matrix, Vectors} import org.apache.spark.mllib.linalg.DenseVector import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types.{StructField, StructType} import org.apache.spark.storage.StorageLevel val dataPath = "/HiBench/CP2M5K" val outputPath = "/tmp/ml/dataset/CP2M5K" spark .sparkContext .objectFile[DenseVector](dataPath) .map(row => Vectors.dense(row.values).toArray.map{u=>f"$u%.2f"}.mkString(",")) .saveAsTextFile(outputPath)
- Check the HDFS directory. The following figure shows the result.
1hadoop fs -ls /tmp/ml/dataset/CP2M5K

Generating an ALS Dataset
- Set the HiBench configuration file.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-HiBench-7.0/conf/workloads/ml/als.conf configuration file.
1vi /HiBench-HiBench-7.0/conf/workloads/ml/als.conf - Press i to enter the insert mode and modify the file as follows:

- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-7.0/conf/hibench.conf configuration file.
1vi hibench.conf - Press i to enter the insert mode and modify the file as follows:

- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
- Generate data.
- Create a directory for storing the generated data in HDFS.
1hdfs dfs -mkdir -p /tmp/ml/dataset/ALS
- Go to the path where the execution script resides.
1cd /HiBench-HiBench-7.0/bin/workloads/ml/als/prepare/
- Run the script to generate the ALS dataset.
1sh prepare.sh - View the result.
1hadoop fs -ls /tmp/ml/dataset/ALS

If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
- Create a directory for storing the generated data in HDFS.
Generating a D200M100 Dataset
- Set the HiBench configuration file.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-HiBench-7.0/conf/workloads/ml/kmeans.conf configuration file.
1vi /HiBench-HiBench-7.0/conf/workloads/ml/kmeans.conf - Press i to enter the insert mode and modify the file as follows:
1 2 3 4 5 6 7 8 9
hibench.kmeans.gigantic.num_of_clusters 5 hibench.kmeans.gigantic.dimensions 100 hibench.kmeans.gigantic.num_of_samples 200000000 hibench.kmeans.gigantic.samples_per_inputfile 40000000 hibench.kmeans.gigantic.max_iteration 5 hibench.kmeans.gigantic.k 10 hibench.kmeans.gigantic.convergedist 0.5 hibench.workload.input hdfs://server1:8020/tmp/ml/dataset/kmeans_200m20_tmp

- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-HiBench-7.0/conf/hibench.conf configuration file.
1vi hibench.conf - Press i to enter the insert mode and modify the file as follows:

- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
- Generate data.
- Create a directory for storing the generated data in HDFS.
1hdfs dfs -mkdir -p /tmp/ml/dataset/kmeans_200m100_tmp
- Go to the path where the execution script resides.
1cd /HiBench-HiBench-7.0/bin/workloads/ml/kmeans/prepare/
- Run the script to generate the D200M100 dataset.
1sh prepare.sh
- Create a directory for storing the generated data in HDFS.
- View the result.
1hdfs dfs -ls /tmp/ml/dataset/kmeans_200m100_tmp
If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
- Create a directory for storing the generated dataset in HDFS.
1hdfs dfs -ls /tmp/ml/dataset/kmeans_200m100
- Move the dataset to a specified path.
1hdfs dfs -mv /tmp/ml/dataset/kmeans_200m100_tmp/samples/* /tmp/ml/dataset/kmeans_200m100/
- View the result.
1hdfs dfs -ls /tmp/ml/dataset/kmeans_200m100/

- Delete redundant directories.
1hdfs dfs -rm -r /tmp/ml/dataset/kmeans_200m100_tmp
Generating a D10M4096 Dataset
- Set the HiBench configuration file.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-HiBench-7.0/conf/workloads/ml/lr.conf configuration file.
1vi /HiBench-HiBench-7.0/conf/workloads/ml/lr.conf - Press i to enter the insert mode and modify the file as follows: Change the number of data samples to 10000000 and the data feature to 4096. Generate a dataset with 40 million data samples.
1 2
hibench.lr.bigdata.examples 10000000 hibench.lr.bigdata.features 4096

- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-HiBench-7.0/conf/hibench.conf configuration file.
1vi hibench.conf - Press i to enter the insert mode and modify the file as follows:
1 2 3
hibench.scale.profile bigdata hibench.default.map.parallelism 300 hibench.default.shuffle.parallelism 300

- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
- Generate data.
- Create a directory for storing the generated data in HDFS.
1hdfs dfs -mkdir -p /tmp/ml/dataset/
- Go to the path where the execution script resides.
1cd /HiBench-HiBench-7.0/bin/workloads/ml/lr/prepare/
- Run the script to generate the D10M4096 dataset.
1sh prepare.sh
- Create a directory for storing the generated data in HDFS.
- View the result.
1hdfs dfs -ls /HiBench/HiBench/LR/Input
If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
- Start spark-shell.
1spark-shell
- Run the following command:
1:paste
- Execute the following code to process the dataset:
1 2 3 4 5 6 7
import org.apache.spark.rdd.RDD import org.apache.spark.mllib.regression.LabeledPoint val data: RDD[LabeledPoint] = sc.objectFile("/HiBench/HiBench/LR/Input/10m4096") val i = data.map{t=>t.label.toString+","+t.features.toArray.mkString(" ")} val splits = i.randomSplit(Array(0.6, 0.4), seed = 11L) splits(0).saveAsTextFile("/HiBench/HiBench/LR/Output/10m4096_train") splits(1).saveAsTextFile("/HiBench/HiBench/LR/Output/10m4096_test")
Generating a HiBench_10M_200M Dataset
- Set the HiBench configuration file.
- Go to the HiBench-HiBench-7.0/conf directory.
1cd /HiBench-HiBench-7.0/conf
- Open the /HiBench-HiBench-7.0/conf/workloads/ml/lda.conf configuration file.
1vi /HiBench-HiBench-7.0/conf/workloads/ml/lda.conf - Press i to enter the insert mode and modify the file as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
hibench.lda.bigdata.num_of_documents 10000000 hibench.lda.bigdata.num_of_vocabulary 200007152 hibench.lda.bigdata.num_of_topics 100 hibench.lda.bigdata.doc_len_min 500 hibench.lda.bigdata.doc_len_max 10000 hibench.lda.bigdata.maxresultsize "6g" hibench.lda.num_of_documents ${hibench.lda.${hibench.scale.profile}.num_of_documents} hibench.lda.num_of_vocabulary ${hibench.lda.${hibench.scale.profile}.num_of_vocabulary} hibench.lda.num_of_topics ${hibench.lda.${hibench.scale.profile}.num_of_topics} hibench.lda.doc_len_min ${hibench.lda.${hibench.scale.profile}.doc_len_min} hibench.lda.doc_len_max ${hibench.lda.${hibench.scale.profile}.doc_len_max} hibench.lda.maxresultsize ${hibench.lda.${hibench.scale.profile}.maxresultsize} hibench.lda.partitions ${hibench.default.map.parallelism} hibench.lda.optimizer "online" hibench.lda.num_iterations 10
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Go to the HiBench-HiBench-7.0/conf directory.
- Generate data.
- Create a directory for storing the generated data in HDFS.
1hdfs dfs -mkdir -p /tmp/ml/dataset/
- Go to the path where the execution script resides.
1cd /HiBench-HiBench-7.0/bin/workloads/ml/lda/prepare/
- Run bin/workloads/ml/lda/prepare/prepare.sh to generate a dataset.
1sh prepare.sh
- Create a directory for storing the generated data in HDFS.
- Start spark-shell.
1spark-shell
- Run the following command:
1:paste
- Run the following code to convert the generated data into the ORC format:
1 2 3 4 5 6
import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => OldVectors} import org.apache.spark.ml.linalg.{Vector, Vectors} case class DocSchema(id: Long, tf: Vector) val data: RDD[(Long, OldVector)] = sc.objectFile(dataPath) val df = spark.createDataFrame(data.map {doc => DocSchema(doc._1, doc._2.asML)}) df.repartition(200).write.mode("overwrite").format("orc").save(outputPath)
Generating a HibenchRating3wx3w Dataset
- Modify parameters in the scala file.
- Open the Hibench/sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/RatingDataGenerator.scala file.
1vi Hibench/sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/RatingDataGenerator.scala - Press i to enter the insert mode and modify the numPartitions parameter in. (Comment out lines 36 and 37 and add line 38, as shown in the following figure.)
1val numPartitions = parallel

- Press Esc, type :wq!, and press Enter to save the file and exit.
- Open the Hibench/sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/RatingDataGenerator.scala file.
- Generate data.
- Compile the sparkbench module.
1mvn package - Save the compiled sparkbench-common-8.0-SNAPSHOT.jar and sparkbench-ml-8.0-SNAPSHOT.jar files in the same folder and call RatingDataGenerator to generate data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
spark-submit \ --class com.intel.hibench.sparkbench.ml.RatingDataGenerator \ --jars sparkbench-common-8.0-SNAPSHOT.jar \ --conf "spark.executor.instances=71" \ --conf "spark.executor.cores=4" \ --conf "spark.executor.memory=12g" \ --conf "spark.executor.memoryOverhead=2g" \ --conf "spark.default.parallelism=284" \ --master yarn \ --deploy-mode client \ --driver-cores 36 \ --driver-memory 50g \ ./sparkbench-ml-8.0-SNAPSHOT.jar \ /tmp/hibench/HibenchRating3wx3w 24000 6000 900000 false
Parameters:
- /tmp/hibench/HibenchRating3wx3w: location where the generated data is stored.
- 24000: number of users.
- 6000: number of products.
- 900000: number of ratings.
- false: Implicit feedback data is not generated.
- Compile the sparkbench module.
Generating an Avazu Dataset
Click here to obtain the dataset.
Select a dataset whose file name extension is -site. Number of rows in the training set/Number of rows in the test set/Number of features: 25,832,830/2,858,160/1,000,000.

Parent topic: Generating a Dataset Using HiBench