Generating Datasets

Generating the CP10M1K Dataset

Set the HiBench configuration file.
1. Go to the HiBench-HiBench-7.0/conf directory.
```
cd /HiBench-HiBench-7.0/conf
```
2. Open the /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf configuration file.
```
vim /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf
```
  Modify the configuration as follows:
```
hibench.svd.bigdata.examples            10000000
hibench.svd.bigdata.features            1000
hibench.workload.input                  ${hibench.hdfs.data.dir}/CP10M1K
```
3. Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-7.0/conf/hibench.conf configuration file as shown in the following figure.
```
cd /HiBench-HiBench-7.0/conf
vim hibench.conf
```
Generate a dataset.
1. Create a directory for storing the generated data in HDFS.
```
hdfs dfs -mkdir -p /tmp/ml/dataset/
```
2. Go to the path where the execution script resides.
```
cd /HiBench-HiBench-7.0/bin/workloads/ml/svd/prepare/
```
  1. Run the script to generate the CP10M1K dataset.
```
sh prepare.sh
```
  2. View the result.
```
hadoop fs -ls /HiBench/CP10M1K
```
  3. If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
Create a folder in HDFS.
```
hadoop fs -mkdir -p /tmp/ml/dataset
```
Start spark-shell.
```
spark-shell
```
Run the following command (do not omit the colon):
```
:paste
```

Execute the following code to process the dataset:

import org.apache.spark.internal.Logging
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StructField, StructType}
import org.apache.spark.storage.StorageLevel
val dataPath = "/HiBench/CP10M1K"
val outputPath = "/tmp/ml/dataset/CP10M1K"
spark
.sparkContext
.objectFile[DenseVector](dataPath)
.map(row => Vectors.dense(row.values).toArray.map{u=>f"$u%.2f"}.mkString(","))
.saveAsTextFile(outputPath)

Check the HDFS directory. The following figure shows the result.
```
hadoop fs -ls /tmp/ml/dataset/CP10M1K
```

Generating the CP2M5K Dataset

Set the HiBench configuration file.

Go to the HiBench-HiBench-7.0/conf directory.
```
cd /HiBench-HiBench-7.0/conf
```

Modify the /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf configuration file.

vim /HiBench-HiBench-7.0/conf/workloads/ml/svd.conf

Modify the configuration as follows:

hibench.svd.bigdata.examples            2000000
hibench.svd.bigdata.features            5000
hibench.workload.input                  ${hibench.hdfs.data.dir}/CP2M5K

Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-7.0/conf/hibench.conf configuration file.

cd /HiBench-HiBench-7.0/conf
vim hibench.conf

Modify the configuration as follows:

hibench.scale.profile                 bigdata
hibench.default.map.parallelism       500
hibench.default.shuffle.parallelism   600

Generate data.
1. Create a directory for storing the generated data in HDFS.
```
hdfs dfs -mkdir -p /tmp/ml/dataset/
```
2. Go to the path where the execution script resides.
```
cd /HiBench-HiBench-7.0/bin/workloads/ml/svd/prepare/
```
  1. Run the script to generate the CP2M5K dataset.
```
sh prepare.sh
```
  1. If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
Create a folder in HDFS.
```
hadoop fs -mkdir -p /tmp/ml/dataset
```
Start spark-shell.
```
spark-shell
```
Run the following command (do not omit the colon):
```
:paste
```

Execute the following code to process the dataset:

import org.apache.spark.internal.Logging
import org.apache.spark.ml.linalg.SQLDataTypes.VectorType
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{StructField, StructType}
import org.apache.spark.storage.StorageLevel
val dataPath = "/HiBench/CP2M5K"
val outputPath = "/tmp/ml/dataset/CP2M5K"
spark
.sparkContext
.objectFile[DenseVector](dataPath)
.map(row => Vectors.dense(row.values).toArray.map{u=>f"$u%.2f"}.mkString(","))
.saveAsTextFile(outputPath)

Check the HDFS directory. The following figure shows the result.
```
hadoop fs -ls /tmp/ml/dataset/CP2M5K
```

Generating the ALS Dataset

Set the HiBench configuration file.
1. Go to the HiBench-HiBench-7.0/conf directory.
```
cd /HiBench-HiBench-7.0/conf
```
2. Modify the /HiBench-HiBench-7.0/conf/workloads/ml/als.conf configuration file as follows:
```
vim /HiBench-HiBench-7.0/conf/workloads/ml/als.conf
```
3. Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-7.0/conf/hibench.conf configuration file.
```
cd /HiBench-HiBench-7.0/conf
vim hibench.conf
```
  Modify the configuration as follows:
Generate data.
1. Create a directory for storing the generated data in HDFS.
```
hdfs dfs -mkdir -p /tmp/ml/dataset/ALS
```
2. Go to the path where the execution script resides.
```
cd /HiBench-HiBench-7.0/bin/workloads/ml/als/prepare/
```
3. Run the script to generate the ALS dataset.
```
sh prepare.sh
```
4. View the result.
```
hadoop fs -ls /tmp/ml/dataset/ALS
```
  If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.

Generating the D200M100 Dataset

Set the HiBench configuration file.

Go to the HiBench-HiBench-7.0/conf directory.
```
cd /HiBench-HiBench-7.0/conf
```

Modify the /HiBench-HiBench-7.0/conf/workloads/ml/kmeans.conf configuration file.

vim /HiBench-HiBench-7.0/conf/workloads/ml/kmeans.conf

Modify the configuration as follows:

hibench.kmeans.gigantic.num_of_clusters		5
hibench.kmeans.gigantic.dimensions		100
hibench.kmeans.gigantic.num_of_samples		200000000
hibench.kmeans.gigantic.samples_per_inputfile	40000000
hibench.kmeans.gigantic.max_iteration		5
hibench.kmeans.gigantic.k			10
hibench.kmeans.gigantic.convergedist		0.5

 hibench.workload.input                         hdfs://server1:8020/tmp/ml/dataset/kmeans_200m20_tmp

Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-HiBench-7.0/conf/hibench.conf configuration file as follows:
```
cd /HiBench-HiBench-7.0/conf
vim hibench.conf
```

Generate data.
1. Create a directory for storing the generated data in HDFS.
```
hdfs dfs -mkdir -p /tmp/ml/dataset/kmeans_200m100_tmp
```
2. Go to the path where the execution script resides.
```
cd /HiBench-HiBench-7.0/bin/workloads/ml/kmeans/prepare/
```
3. Run the script to generate the D200M100 dataset.
```
sh prepare.sh
```
View the result.
```
hdfs dfs -ls /tmp/ml/dataset/kmeans_200m100_tmp
```
If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
Create a directory for storing the generated dataset in HDFS.
```
hdfs dfs -ls /tmp/ml/dataset/kmeans_200m100
```

Move the dataset to a specified path.

hdfs dfs -mv /tmp/ml/dataset/kmeans_200m100_tmp/samples/* /tmp/ml/dataset/kmeans_200m100/

View the result.

hdfs dfs -ls /tmp/ml/dataset/kmeans_200m100/

Delete redundant directories.

hdfs dfs -rm -r /tmp/ml/dataset/kmeans_200m100_tmp

Generating the D10M4096 Dataset

Set the HiBench configuration file.
1. Go to the HiBench-HiBench-7.0/conf directory.
```
cd /HiBench-HiBench-7.0/conf
```
2. Modify the /HiBench-HiBench-7.0/conf/workloads/ml/lr.conf configuration file.
```
vim /HiBench-HiBench-7.0/conf/workloads/ml/lr.conf
```
  Modify the configuration as follows: (Change the number of data samples to 10 million and the number of data features to 4096 to generate a large-scale dataset with 40 million samples.)
```
hibench.lr.bigdata.examples  10000000
hibench.lr.bigdata.features  4096
```
3. Go to the HiBench-HiBench-7.0/conf directory and modify the /HiBench-HiBench-7.0/conf/hibench.conf configuration file.
```
cd /HiBench-HiBench-7.0/conf
vim hibench.conf
```
  Modify the configuration as follows:
```
hibench.scale.profile                 bigdata
hibench.default.map.parallelism       300
hibench.default.shuffle.parallelism   300
```
Generate data.
1. Create a directory for storing the generated data in HDFS.
```
hdfs dfs -mkdir -p /tmp/ml/dataset/
```
2. Go to the path where the execution script resides.
```
cd /HiBench-HiBench-7.0/bin/workloads/ml/lr/prepare/
```
3. Run the script to generate the D10M4096 dataset.
```
sh prepare.sh
```
View the result.
```
hdfs dfs -ls /HiBench/HiBench/LR/Input
```
If a permission error is reported during the generation, change the permission for the corresponding directories as the root user and in HDFS respectively.
Start spark-shell.
```
spark-shell
```
Run the following command (do not omit the colon):
```
:paste
```

Execute the following code to process the dataset:

import org.apache.spark.rdd.RDD
import org.apache.spark.mllib.regression.LabeledPoint
val data: RDD[LabeledPoint] = sc.objectFile("/HiBench/HiBench/LR/Input/10m4096")
val i = data.map{t=>t.label.toString+","+t.features.toArray.mkString(" ")}
val splits = i.randomSplit(Array(0.6, 0.4), seed = 11L)
splits(0).saveAsTextFile("/HiBench/HiBench/LR/Output/10m4096_train")
splits(1).saveAsTextFile("/HiBench/HiBench/LR/Output/10m4096_test")

Generating the HiBench_10M_200M Dataset

Set the HiBench configuration file.

Go to the HiBench-HiBench-7.0/conf directory.
```
cd /HiBench-HiBench-7.0/conf
```

Modify the /HiBench-HiBench-7.0/conf/workloads/ml/lda.conf file as follows:

hibench.lda.bigdata.num_of_documents              10000000
hibench.lda.bigdata.num_of_vocabulary             200007152
hibench.lda.bigdata.num_of_topics                 100
hibench.lda.bigdata.doc_len_min                   500
hibench.lda.bigdata.doc_len_max                   10000
hibench.lda.bigdata.maxresultsize                 "6g"
hibench.lda.num_of_documents                      ${hibench.lda.${hibench.scale.profile}.num_of_documents}
hibench.lda.num_of_vocabulary                     ${hibench.lda.${hibench.scale.profile}.num_of_vocabulary}
hibench.lda.num_of_topics                         ${hibench.lda.${hibench.scale.profile}.num_of_topics}
hibench.lda.doc_len_min                           ${hibench.lda.${hibench.scale.profile}.doc_len_min}
hibench.lda.doc_len_max                           ${hibench.lda.${hibench.scale.profile}.doc_len_max}
hibench.lda.maxresultsize                         ${hibench.lda.${hibench.scale.profile}.maxresultsize}
hibench.lda.partitions                            ${hibench.default.map.parallelism}
hibench.lda.optimizer                             "online"
hibench.lda.num_iterations                        10

Generate data.
1. Create a directory for storing the generated data in HDFS.
```
hdfs dfs -mkdir -p /tmp/ml/dataset/
```
2. Go to the path where the execution script resides.
```
cd /HiBench-HiBench-7.0/bin/workloads/ml/lda/prepare/
```
3. Run bin/workloads/ml/lda/prepare/prepare.sh to generate a dataset.
```
sh prepare.sh
```
Start spark-shell.
```
spark-shell
```
Run the following command (do not omit the colon):
```
:paste
```

Run the following code to convert the generated data into the ORC format:

import org.apache.spark.mllib.linalg.{Vector => OldVector, Vectors => OldVectors} 
import org.apache.spark.ml.linalg.{Vector, Vectors}
case class DocSchema(id: Long, tf: Vector)
val data: RDD[(Long, OldVector)] = sc.objectFile(dataPath)
val df = spark.createDataFrame(data.map {doc => DocSchema(doc._1, doc._2.asML)})
df.repartition(200).write.mode("overwrite").format("orc").save(outputPath)

Generating the HibenchRating3wx3w Dataset

Modify the numPartitions parameter in the Hibench/sparkbench/ml/src/main/scala/com/intel/sparkbench/ml/RatingDataGenerator.scala file. (As shown in the following figure, comment out lines 36 and 37 and add line 38.)
```
val numPartitions = parallel
```

Generate data.

Compile the sparkbench module.
```
mvn package
```

Save the compiled sparkbench-common-8.0-SNAPSHOT.jar and sparkbench-ml-8.0-SNAPSHOT.jar files in the same folder and call RatingDataGenerator to generate data.

spark-submit \
--class com.intel.hibench.sparkbench.ml.RatingDataGenerator \
--jars sparkbench-common-8.0-SNAPSHOT.jar \
--conf "spark.executor.instances=71" \
--conf "spark.executor.cores=4" \
--conf "spark.executor.memory=12g" \
--conf "spark.executor.memoryOverhead=2g" \
--conf "spark.default.parallelism=284" \
--master yarn \
--deploy-mode client \
--driver-cores 36 \
--driver-memory 50g \
./sparkbench-ml-8.0-SNAPSHOT.jar \
/tmp/hibench/HibenchRating3wx3w 24000 6000 900000 false

Parameters:

/tmp/hibench/HibenchRating3wx3w: location where the generated data is stored.
24000: number of users.
6000: number of products.
900000: number of ratings.
false: Implicit feedback data is not generated.

Parent topic: Generating a Dataset Using HiBench