Public Datasets
The machine learning test cases in this document use datasets from their official websites. Download house, HIGGS, NYTimes, Kosarak, DEEP1B, MNIST8M, Epsilon and MESH_DEFORM from the official websites. All the following datasets are downloaded, decompressed and uploaded on the server1 node.
Downloading the house Dataset
- Create a /test/dataset/ml directory and go to the directory.
mkdir -p /test/dataset/ml cd /test/dataset/ml
- Download the house dataset here. Make sure that your network can access Google.

- Save the dataset downloaded in 2 to the /test/dataset/ml directory.
- Create a folder in HDFS.
hadoop fs -mkdir -p /tmp/dataset/ml hadoop fs -mkdir -p /tmp/ml/dataset
- Upload the dataset to /tmp/dataset/ml.
hadoop fs -put /test/dataset/ml/house.ds /tmp/dataset/ml
- Start spark-shell.
spark-shell
- Run the following command (do not omit the colon):
:paste
- Execute the following code to process the dataset:
val file = sc.textFile("/tmp/dataset/ml/house.ds") file.take(10).foreach(println(_)) file.count val data = file.map(x => x.split(" ")).filter(_.length == 8).map(x => x.slice(1, 8).mkString(" ")) data.count data.take(10).foreach(println(_)) data.repartition(1).saveAsTextFile("/tmp/ml/dataset/house") - Press Enter and press Ctrl+D.
- Check that the training dataset and test dataset exist in the HDFS directory.
hadoop fs -ls /tmp/ml/dataset/house

- Delete unnecessary dataset directories from HDFS.
hadoop fs -rm -r /tmp/dataset/mlhadoop fs -rm -r /tmp/dataset/ml
Downloading the HIGGS Dataset
- Create a /test/dataset/ml/higgs directory and go to the directory.
mkdir -p /test/dataset/ml/higgs cd /test/dataset/ml/higgs
- Download the HIGGS dataset from the official website.
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/HIGGS.bz2
- Decompress the dataset to the current directory.
bzip2 -d HIGGS.bz2
- Create a /tmp/dataset/ml/higgs folder in HDFS.
hadoop fs -mkdir -p /tmp/ml/dataset/higgs
- Upload the dataset to HDFS.
hadoop fs -put /test/dataset/ml/higgs/HIGGS /tmp/ml/dataset/higgs
- Start spark-shell.
spark-shell
- Run the following command (do not omit the colon):
:paste
- Run the following code to split the dataset into a training dataset and a test dataset:
val reader = spark.read.format("libsvm") reader.option("numFeatures", 28) val dataPath = "/tmp/ml/dataset/higgs" val data = reader.load(dataPath) val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020) val trainOutputPath = s"${dataPath}_train" val testOutputPath = s"${dataPath}_test" trainingData.write.format("libsvm").save(trainOutputPath) testData.write.format("libsvm").save(testOutputPath) - Press Enter and press Ctrl+D.
- Delete unnecessary dataset directories from HDFS.
hadoop fs -rm -r /tmp/ml/dataset/higgs
- Check that the training dataset and test dataset exist in the HDFS directory.
hadoop fs -ls /tmp/ml/dataset

Downloading the NYTimes Dataset
- Create a /test/dataset/ml/nytimes directory and go to the directory.
mkdir -p /test/dataset/ml/nytimes cd /test/dataset/ml/nytimes
- Download the NYTimes dataset from the official website.
wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz
- Decompress the dataset to the current directory.
gzip -d docword.nytimes.txt.gz
- Create a dataset_process.py file. (Make sure correct indentation is applied in the Python file.)
vim dataset_process.py
The file content is as follows:
import sys if __name__ == "__main__": if len(sys.argv) <= 1: print("Please input dataset") exit() filename = sys.argv[1] print("Reading data") processed_data = {} with open(filename, 'r') as fp: data = fp.readlines() print("Pre-processing data") for line in data[3:]: line_split = line.strip().split() if len(line_split) < 3: continue doc_id = int(line_split[0]) vocab_id = line_split[1] term_num = line_split[2] if doc_id not in processed_data: processed_data[doc_id] = str(doc_id) processed_data[doc_id] += (" %s:%s" % (vocab_id, term_num)) print("Post-processing data") doc_ids = list(processed_data.keys()) doc_ids.sort() data = [] for doc_id in doc_ids: data.append(processed_data[doc_id] + "\n") print("Writing data") with open(filename + ".libsvm", 'w') as fp: fp.writelines(data) - Use dataset_process.py to convert the dataset to the LibSVM format.
python3 dataset_process.py docword.nytimes.txt
- Rename dataset docword.nytimes.txt.libsvm as docword.nytimes.txt.libsvm.raw.
- Create a reorder.py file. (Make sure correct indentation is applied in the Python file.)
vim reorder.py
The file content is as follows:
filename = "docword.nytimes.txt.libsvm.raw" new_filename = "docword.nytimes.txt.libsvm" with open(filename, 'r') as fp: filedata = fp.readlines() print("Data length: %d" % len(filedata)) count = 0 data = [] for line in filedata: line_split = line.strip().split() doc_index = int(line_split[0]) doc_terms = {} for term in line_split[1:]: term_split = term.strip().split(":") assert int(term_split[0]) not in doc_terms doc_terms[int(term_split[0])] = int(term_split[1]) data.append([doc_index, doc_terms]) count += 1 if count % 100000 == 0: print("Processed %d00K" % int(count / 100000)) count = 0 new_filedata = [] for doc in data: doc_string = str(doc[0]) term_indices = list(doc[1].keys()) term_indices.sort() for term_index in term_indices: doc_string += (" " + str(term_index) + ":" + str(doc[1][term_index])) doc_string += "\n" new_filedata.append(doc_string) count += 1 if count % 100000 == 0: print("Generated %d00K" % int(count / 100000)) with open(new_filename, 'w') as fp: fp.writelines(new_filedata) - Use reorder.py to reorder the dataset renamed in 6.
python3 reorder.py
- Create a /tmp/dataset/ml/nytimes folder in HDFS.
hadoop fs -mkdir -p /tmp/ml/dataset/nytimes/
- Upload the dataset to HDFS.
hadoop fs -put /test/dataset/ml/nytimes/docword.nytimes.txt.libsvm /tmp/ml/dataset/nytimes/
Downloading the Kosarak Dataset
- Create a /test/dataset/ml/Kosarak directory and go to the directory.
mkdir -p /test/dataset/ml/Kosarak cd /test/dataset/ml/Kosarak
- Download the Kosarak dataset from the official website.
wget http://www.philippe-fournier-viger.com/spmf/datasets/kosarak_sequences.txt
- Create a /tmp/ml/dataset/Kosarak folder in HDFS.
hadoop fs -mkdir -p /tmp/ml/dataset/Kosarak/
- Upload the dataset to HDFS.
hadoop fs -put /test/dataset/ml/Kosarak/kosarak_sequences.txt /tmp/ml/dataset/Kosarak/
Downloading the DEEP1B Dataset
- Create a /test/dataset/ml/DEEP1B directory and go to the directory.
mkdir -p /test/dataset/ml/DEEP1B cd /test/dataset/ml/DEEP1B
- Download the DEEP1B dataset from the official website.
wget http://ann-benchmarks.com/deep-image-96-angular.hdf5
- Create a processHDF5.py file. (Make sure correct indentation is applied in the Python file.)
vim processHDF5.py
The file content is as follows:
import os import h5py # downloaded hdf5 file inputFile = h5py.File('deep-image-96-angular.hdf5', 'r') # directory name to store output files outputDir = "deep1b" # the number of samples in each output file samplesPerFile = 5000 sampleCnt = 0 fileCnt = 0 writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w') data = inputFile['train'] for feature in data: writer.write(','.join([str(d) for d in feature]) + "\n") sampleCnt += 1 if sampleCnt == samplesPerFile: writer.close() fileCnt += 1 sampleCnt = 0 writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w') data = inputFile['test'] for feature in data: writer.write(','.join([str(d) for d in feature]) + "\n") sampleCnt += 1 if sampleCnt == samplesPerFile: writer.close() fileCnt += 1 sampleCnt = 0 writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w') writer.close() - Convert the HDF5 file into a text file. Each sample occupies a line, and features are separated by commas (,).
mkdir deep1b python3 processHDF5.py
- If the following error information is displayed, run the python3 -m pip install h5py script.

- Create a /tmp/ml/dataset/DEEP1 folder in HDFS.
hadoop fs -mkdir -p /tmp/ml/dataset/DEEP1B
- Upload the dataset to HDFS.
hadoop fs -put /test/dataset/ml/deep1b/* /tmp/ml/dataset/DEEP1B/
Downloading the MNIST8M Dataset
- Create a /test/dataset/ml/mnist8m directory and go to the directory.
mkdir -p /test/dataset/ml/mnist8m cd /test/dataset/ml/mnist8m
- Download the MNIST8M dataset from the official website.
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.bz2
- Decompress the training dataset and test dataset to the current directory.
bzip2 -d mnist8m.bz2
- Create a /tmp/dataset/ml/mnist8m folder in HDFS.
hadoop fs -mkdir -p /tmp/ml/dataset/mnist8m
- Upload the dataset to HDFS.
hadoop fs -put /test/dataset/ml/mnist8m/mnist8m /tmp/ml/dataset/mnist8m
- Start spark-shell.
spark-shell
- Run the following command (do not omit the colon):
:paste
- Run the following code to split the dataset into a training dataset and a test dataset:
val reader = spark.read.format("libsvm") reader.option("numFeatures", 784) val dataPath = "/tmp/ml/dataset/mnist8m" val data = reader.load(dataPath) val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020) val trainOutputPath = s"${dataPath}_train" val testOutputPath = s"${dataPath}_test" trainingData.write.format("libsvm").save(trainOutputPath) testData.write.format("libsvm").save(testOutputPath) - Press Enter and press Ctrl+D.
- Delete unnecessary dataset directories from HDFS.
hadoop fs -rm -r /tmp/ml/dataset/mnist8m
- Check that the training dataset and test dataset exist in the HDFS directory.
hadoop fs -ls /tmp/ml/dataset

Downloading the Epsilon Dataset
- Create a /test/dataset/ml/epsilon directory and go to the directory.
mkdir -p /test/dataset/ml/epsilon cd /test/dataset/ml/epsilon
- Download the Epsilon training dataset and test dataset from the official website.
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.bz2 wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2
- Decompress the training dataset and test dataset to the current directory.
bzip2 -d epsilon_normalized.bz2 bzip2 -d epsilon_normalized.t.bz2

- Create /tmp/dataset/ml/epsilon_train and /tmp/dataset/ml/epsilon_test folders in HDFS.
hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_train hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_test
- Upload the training dataset and test dataset to HDFS.
hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized /tmp/ml/dataset/epsilon_train hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized.t /tmp/ml/dataset/epsilon_test
Downloading the MESH_DEFORM Dataset
- Create a /test/dataset/ml/mesh_deform directory and go to the directory.
mkdir -p /test/dataset/ml/mesh_deform cd /test/dataset/ml/mesh_deform
- Download the MESH_DEFORM dataset from the official website.
wget https://suitesparse-collection-website.herokuapp.com/MM/Yoshiyasu/mesh_deform.tar.gz
- Decompress the dataset to the current directory.
tar zxvf mesh_deform.tar.gz
- Open the extracted mesh_deform.mtx file and delete lines 1 to 25. Lines 1 to 24 are information lines. Line 25 indicates the number of rows, columns, and non-zero elements in the matrix. The data starts from line 26.
vim mesh_deform.mtx
- Create a /tmp/ml/dataset/MESH_DEFORM folder in HDFS.
hadoop fs –mkdir -p /tmp/ml/dataset/MESH_DEFORM
- Upload the dataset to HDFS.
hadoop fs -put mesh_deform.mtx /tmp/ml/dataset/MESH_DEFORM/
- Check that the dataset exists in the HDFS directory.
hadoop fs -ls /tmp/ml/dataset/MESH_DEFORM

Parent topic: Test Dataset