Public Datasets

The machine learning test cases in this document use datasets from their official websites. Download house, HIGGS, NYTimes, Kosarak, DEEP1B, MNIST8M, Epsilon and MESH_DEFORM from the official websites. All the following datasets are downloaded, decompressed and uploaded on the server1 node.

Downloading the house Dataset

Create a /test/dataset/ml directory and go to the directory.
```
mkdir -p /test/dataset/ml
cd /test/dataset/ml
```
Download the house dataset here. Make sure that your network can access Google.
Save the dataset downloaded in 2 to the /test/dataset/ml directory.

Create a folder in HDFS.

hadoop fs -mkdir -p /tmp/dataset/ml
hadoop fs -mkdir -p /tmp/ml/dataset

Upload the dataset to /tmp/dataset/ml.

hadoop fs -put /test/dataset/ml/house.ds /tmp/dataset/ml

Start spark-shell.
```
spark-shell
```
Run the following command (do not omit the colon):
```
:paste
```

Execute the following code to process the dataset:

val file = sc.textFile("/tmp/dataset/ml/house.ds")
file.take(10).foreach(println(_))
file.count
val data = file.map(x => x.split(" ")).filter(_.length == 8).map(x => x.slice(1, 8).mkString(" "))
data.count
data.take(10).foreach(println(_))
data.repartition(1).saveAsTextFile("/tmp/ml/dataset/house")

Press Enter and press Ctrl+D.
Check that the training dataset and test dataset exist in the HDFS directory.
```
hadoop fs -ls /tmp/ml/dataset/house
```

Delete unnecessary dataset directories from HDFS.

hadoop fs -rm -r /tmp/dataset/mlhadoop fs -rm -r /tmp/dataset/ml

Downloading the HIGGS Dataset

Create a /test/dataset/ml/higgs directory and go to the directory.
```
mkdir -p /test/dataset/ml/higgs
cd /test/dataset/ml/higgs
```

Download the HIGGS dataset from the official website.

wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/HIGGS.bz2

Decompress the dataset to the current directory.
```
bzip2 -d HIGGS.bz2
```
Create a /tmp/dataset/ml/higgs folder in HDFS.
```
hadoop fs -mkdir -p /tmp/ml/dataset/higgs
```

Upload the dataset to HDFS.

hadoop fs -put /test/dataset/ml/higgs/HIGGS /tmp/ml/dataset/higgs

Start spark-shell.
```
spark-shell
```
Run the following command (do not omit the colon):
```
:paste
```

Run the following code to split the dataset into a training dataset and a test dataset:

val reader = spark.read.format("libsvm")
reader.option("numFeatures", 28)
val dataPath = "/tmp/ml/dataset/higgs"
val data = reader.load(dataPath)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020)
val trainOutputPath = s"${dataPath}_train"
val testOutputPath = s"${dataPath}_test"
trainingData.write.format("libsvm").save(trainOutputPath)
testData.write.format("libsvm").save(testOutputPath)

Press Enter and press Ctrl+D.
Delete unnecessary dataset directories from HDFS.
```
hadoop fs -rm -r /tmp/ml/dataset/higgs
```
Check that the training dataset and test dataset exist in the HDFS directory.
```
hadoop fs -ls /tmp/ml/dataset
```

Downloading the NYTimes Dataset

Create a /test/dataset/ml/nytimes directory and go to the directory.
```
mkdir -p /test/dataset/ml/nytimes
cd /test/dataset/ml/nytimes
```

Download the NYTimes dataset from the official website.

wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz

Decompress the dataset to the current directory.
```
gzip -d docword.nytimes.txt.gz
```

Create a dataset_process.py file. (Make sure correct indentation is applied in the Python file.)

vim dataset_process.py

The file content is as follows:

import sys
if __name__ == "__main__":
if len(sys.argv) <= 1:
print("Please input dataset")
exit()
filename = sys.argv[1]
print("Reading data")
processed_data = {}
with open(filename, 'r') as fp:
data = fp.readlines()
print("Pre-processing data")
for line in data[3:]:
line_split = line.strip().split()
if len(line_split) < 3:
continue
doc_id = int(line_split[0])
vocab_id = line_split[1]
term_num = line_split[2]
if doc_id not in processed_data:
processed_data[doc_id] = str(doc_id)
processed_data[doc_id] += (" %s:%s" % (vocab_id, term_num))
print("Post-processing data")
doc_ids = list(processed_data.keys())
doc_ids.sort()
data = []
for doc_id in doc_ids:
data.append(processed_data[doc_id] + "\n")
print("Writing data")
with open(filename + ".libsvm", 'w') as fp:
fp.writelines(data)

Use dataset_process.py to convert the dataset to the LibSVM format.
```
python3 dataset_process.py docword.nytimes.txt
```
Rename dataset docword.nytimes.txt.libsvm as docword.nytimes.txt.libsvm.raw.

Create a reorder.py file. (Make sure correct indentation is applied in the Python file.)

vim reorder.py

The file content is as follows:

filename = "docword.nytimes.txt.libsvm.raw"
new_filename = "docword.nytimes.txt.libsvm"
with open(filename, 'r') as fp:
filedata = fp.readlines()
print("Data length: %d" % len(filedata))
count = 0
data = []
for line in filedata:
line_split = line.strip().split()
doc_index = int(line_split[0])
doc_terms = {}
for term in line_split[1:]:
term_split = term.strip().split(":")
assert int(term_split[0]) not in doc_terms
doc_terms[int(term_split[0])] = int(term_split[1])
data.append([doc_index, doc_terms])
count += 1
if count % 100000 == 0:
print("Processed %d00K" % int(count / 100000))
count = 0
new_filedata = []
for doc in data:
doc_string = str(doc[0])
term_indices = list(doc[1].keys())
term_indices.sort()
for term_index in term_indices:
doc_string += (" " + str(term_index) + ":" + str(doc[1][term_index]))
doc_string += "\n"
new_filedata.append(doc_string)
count += 1
if count % 100000 == 0:
print("Generated %d00K" % int(count / 100000))
with open(new_filename, 'w') as fp:
fp.writelines(new_filedata)

Use reorder.py to reorder the dataset renamed in 6.
```
python3 reorder.py
```
Create a /tmp/dataset/ml/nytimes folder in HDFS.
```
hadoop fs -mkdir -p /tmp/ml/dataset/nytimes/
```

Upload the dataset to HDFS.

hadoop fs -put /test/dataset/ml/nytimes/docword.nytimes.txt.libsvm /tmp/ml/dataset/nytimes/

Downloading the Kosarak Dataset

Create a /test/dataset/ml/Kosarak directory and go to the directory.
```
mkdir -p /test/dataset/ml/Kosarak
cd /test/dataset/ml/Kosarak
```

Download the Kosarak dataset from the official website.

wget http://www.philippe-fournier-viger.com/spmf/datasets/kosarak_sequences.txt

Create a /tmp/ml/dataset/Kosarak folder in HDFS.
```
hadoop fs -mkdir -p /tmp/ml/dataset/Kosarak/
```

Upload the dataset to HDFS.

hadoop fs -put /test/dataset/ml/Kosarak/kosarak_sequences.txt /tmp/ml/dataset/Kosarak/

Downloading the DEEP1B Dataset

Create a /test/dataset/ml/DEEP1B directory and go to the directory.
```
mkdir -p /test/dataset/ml/DEEP1B
cd /test/dataset/ml/DEEP1B
```

Download the DEEP1B dataset from the official website.

wget http://ann-benchmarks.com/deep-image-96-angular.hdf5

Create a processHDF5.py file. (Make sure correct indentation is applied in the Python file.)

vim processHDF5.py

The file content is as follows:

import os
import h5py
# downloaded hdf5 file
inputFile = h5py.File('deep-image-96-angular.hdf5', 'r')
# directory name to store output files
outputDir = "deep1b"
# the number of samples in each output file
samplesPerFile = 5000
sampleCnt = 0
fileCnt = 0
writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
data = inputFile['train']
for feature in data:
writer.write(','.join([str(d) for d in feature]) + "\n")
sampleCnt += 1
if sampleCnt == samplesPerFile:
writer.close()
fileCnt += 1
sampleCnt = 0
writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
data = inputFile['test']
for feature in data:
writer.write(','.join([str(d) for d in feature]) + "\n")
sampleCnt += 1
if sampleCnt == samplesPerFile:
writer.close()
fileCnt += 1
sampleCnt = 0
writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
writer.close()

Convert the HDF5 file into a text file. Each sample occupies a line, and features are separated by commas (,).
```
mkdir deep1b
python3 processHDF5.py
```
If the following error information is displayed, run the python3 -m pip install h5py script.
Create a /tmp/ml/dataset/DEEP1 folder in HDFS.
```
hadoop fs -mkdir -p /tmp/ml/dataset/DEEP1B
```

Upload the dataset to HDFS.

hadoop fs -put /test/dataset/ml/deep1b/* /tmp/ml/dataset/DEEP1B/

Downloading the MNIST8M Dataset

Create a /test/dataset/ml/mnist8m directory and go to the directory.
```
mkdir -p /test/dataset/ml/mnist8m
cd /test/dataset/ml/mnist8m
```

Download the MNIST8M dataset from the official website.

wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.bz2

Decompress the training dataset and test dataset to the current directory.
```
bzip2 -d mnist8m.bz2
```
Create a /tmp/dataset/ml/mnist8m folder in HDFS.
```
hadoop fs -mkdir -p /tmp/ml/dataset/mnist8m
```

Upload the dataset to HDFS.

hadoop fs -put /test/dataset/ml/mnist8m/mnist8m /tmp/ml/dataset/mnist8m

Start spark-shell.
```
spark-shell
```
Run the following command (do not omit the colon):
```
:paste
```

Run the following code to split the dataset into a training dataset and a test dataset:

val reader = spark.read.format("libsvm")
reader.option("numFeatures", 784)
val dataPath = "/tmp/ml/dataset/mnist8m"
val data = reader.load(dataPath)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020)
val trainOutputPath = s"${dataPath}_train"
val testOutputPath = s"${dataPath}_test"
trainingData.write.format("libsvm").save(trainOutputPath)
testData.write.format("libsvm").save(testOutputPath)

Press Enter and press Ctrl+D.
Delete unnecessary dataset directories from HDFS.
```
hadoop fs -rm -r /tmp/ml/dataset/mnist8m
```
Check that the training dataset and test dataset exist in the HDFS directory.
```
hadoop fs -ls /tmp/ml/dataset
```

Downloading the Epsilon Dataset

Create a /test/dataset/ml/epsilon directory and go to the directory.
```
mkdir -p /test/dataset/ml/epsilon
cd /test/dataset/ml/epsilon
```

Download the Epsilon training dataset and test dataset from the official website.

wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.bz2
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2

Decompress the training dataset and test dataset to the current directory.
```
bzip2 -d epsilon_normalized.bz2
bzip2 -d epsilon_normalized.t.bz2
```

Create /tmp/dataset/ml/epsilon_train and /tmp/dataset/ml/epsilon_test folders in HDFS.

hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_train
hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_test

Upload the training dataset and test dataset to HDFS.

hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized /tmp/ml/dataset/epsilon_train
hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized.t /tmp/ml/dataset/epsilon_test

Downloading the MESH_DEFORM Dataset

Create a /test/dataset/ml/mesh_deform directory and go to the directory.

mkdir -p /test/dataset/ml/mesh_deform
cd /test/dataset/ml/mesh_deform

Download the MESH_DEFORM dataset from the official website.

wget https://suitesparse-collection-website.herokuapp.com/MM/Yoshiyasu/mesh_deform.tar.gz

Decompress the dataset to the current directory.
```
tar zxvf mesh_deform.tar.gz
```
Open the extracted mesh_deform.mtx file and delete lines 1 to 25. Lines 1 to 24 are information lines. Line 25 indicates the number of rows, columns, and non-zero elements in the matrix. The data starts from line 26.
```
vim mesh_deform.mtx
```

Create a /tmp/ml/dataset/MESH_DEFORM folder in HDFS.

hadoop fs –mkdir -p /tmp/ml/dataset/MESH_DEFORM

Upload the dataset to HDFS.

hadoop fs -put mesh_deform.mtx /tmp/ml/dataset/MESH_DEFORM/

Check that the dataset exists in the HDFS directory.
```
hadoop fs -ls /tmp/ml/dataset/MESH_DEFORM
```

Parent topic: Test Dataset