Rate This Document
Findability
Accuracy
Completeness
Readability

Public Datasets

The machine learning test cases in this document use datasets from their official websites. Download house, HIGGS, NYTimes, Kosarak, DEEP1B, MNIST8M, Epsilon and MESH_DEFORM from the official websites. All the following datasets are downloaded, decompressed and uploaded on the server1 node.

Downloading the house Dataset

  1. Create a /test/dataset/ml directory and go to the directory.
    1
    2
    mkdir -p /test/dataset/ml
    cd /test/dataset/ml
    
  2. Download the house dataset from the official website. (Connection to Google is required.)

  3. Save the dataset downloaded in 2 to the /test/dataset/ml directory.
  4. Create a folder in HDFS.
    1
    2
    hadoop fs -mkdir -p /tmp/dataset/ml
    hadoop fs -mkdir -p /tmp/ml/dataset
    
  5. Upload the dataset to /tmp/dataset/ml.
    1
    hadoop fs -put /test/dataset/ml/house.ds /tmp/dataset/ml
    
  6. Start spark-shell.
    1
    spark-shell
    
  7. Run the following command:
    1
    :paste
    
  8. Execute the following code to process the dataset:
    1
    2
    3
    4
    5
    6
    7
    val file = sc.textFile("/tmp/dataset/ml/house.ds")
    file.take(10).foreach(println(_))
    file.count
    val data = file.map(x => x.split(" ")).filter(_.length == 8).map(x => x.slice(1, 8).mkString(" "))
    data.count
    data.take(10).foreach(println(_))
    data.repartition(1).saveAsTextFile("/tmp/ml/dataset/house")
    
  9. Press Enter and press Ctrl+D.
  10. Check that the training dataset and test dataset exist in the HDFS directory.
    1
    hadoop fs -ls /tmp/ml/dataset/house
    

  11. Delete unnecessary dataset directories from HDFS.
    1
    hadoop fs -rm -r /tmp/dataset/mlhadoop fs -rm -r /tmp/dataset/ml
    

Downloading the HIGGS Dataset

  1. Create a /test/dataset/ml/higgs directory and go to the directory.
    1
    2
    mkdir -p /test/dataset/ml/higgs
    cd /test/dataset/ml/higgs
    
  2. Download the HIGGS dataset from the official website.
    1
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/HIGGS.bz2
    
  3. Decompress the dataset to the current directory.
    1
    bzip2 -d HIGGS.bz2
    
  4. Create a /tmp/dataset/ml/higgs folder in HDFS.
    1
    hadoop fs -mkdir -p /tmp/ml/dataset/higgs
    
  5. Upload the dataset to HDFS.
    1
    hadoop fs -put /test/dataset/ml/higgs/HIGGS /tmp/ml/dataset/higgs
    
  6. Start spark-shell.
    1
    spark-shell
    
  7. Run the following command:
    1
    :paste
    
  8. Run the following code to split the dataset into a training dataset and a test dataset:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    val reader = spark.read.format("libsvm")
    reader.option("numFeatures", 28)
    val dataPath = "/tmp/ml/dataset/higgs"
    val data = reader.load(dataPath)
    val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020)
    val trainOutputPath = s"${dataPath}_train"
    val testOutputPath = s"${dataPath}_test"
    trainingData.write.format("libsvm").save(trainOutputPath)
    testData.write.format("libsvm").save(testOutputPath)
    
  9. Press Enter and run Ctrl + d.
  10. Delete unnecessary dataset directories from HDFS.
    1
    hadoop fs -rm -r /tmp/ml/dataset/higgs
    
  11. Check that the training dataset and test dataset exist in the HDFS directory.
    1
    hadoop fs -ls /tmp/ml/dataset
    

Downloading the NYTimes Dataset

  1. Create a /test/dataset/ml/nytimes directory and go to the directory.
    1
    2
    mkdir -p /test/dataset/ml/nytimes
    cd /test/dataset/ml/nytimes
    
  2. Download the NYTimes dataset from the official website.
    1
    wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz
    
  3. Decompress the dataset to the current directory.
    1
    gzip -d docword.nytimes.txt.gz
    
  4. Create a dataset_process.py file. (Make sure correct indentation is applied in the Python file.)
    1. Create a file.
      1
      vi dataset_process.py
      
    2. Press i to enter the insert mode and add the following content to the file:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      import sys
      if __name__ == "__main__":
      if len(sys.argv) <= 1:
      print("Please input dataset")
      exit()
      filename = sys.argv[1]
      print("Reading data")
      processed_data = {}
      with open(filename, 'r') as fp:
      data = fp.readlines()
      print("Pre-processing data")
      for line in data[3:]:
      line_split = line.strip().split()
      if len(line_split) < 3:
      continue
      doc_id = int(line_split[0])
      vocab_id = line_split[1]
      term_num = line_split[2]
      if doc_id not in processed_data:
      processed_data[doc_id] = str(doc_id)
      processed_data[doc_id] += (" %s:%s" % (vocab_id, term_num))
      print("Post-processing data")
      doc_ids = list(processed_data.keys())
      doc_ids.sort()
      data = []
      for doc_id in doc_ids:
      data.append(processed_data[doc_id] + "\n")
      print("Writing data")
      with open(filename + ".libsvm", 'w') as fp:
      fp.writelines(data)
      
    3. Press Esc, type :wq!, and press Enter to save the file and exit.
  5. Use dataset_process.py to convert the dataset to the LibSVM format.
    1
    python3 dataset_process.py docword.nytimes.txt
    
  6. Rename dataset docword.nytimes.txt.libsvm as docword.nytimes.txt.libsvm.raw.
  7. Create a reorder.py file. (Make sure correct indentation is applied in the Python file.)
    1. Create a file.
      1
      vi reorder.py
      
    2. Press i to enter the insert mode and add the following content to the file:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      filename = "docword.nytimes.txt.libsvm.raw"
      new_filename = "docword.nytimes.txt.libsvm"
      with open(filename, 'r') as fp:
      filedata = fp.readlines()
      print("Data length: %d" % len(filedata))
      count = 0
      data = []
      for line in filedata:
      line_split = line.strip().split()
      doc_index = int(line_split[0])
      doc_terms = {}
      for term in line_split[1:]:
      term_split = term.strip().split(":")
      assert int(term_split[0]) not in doc_terms
      doc_terms[int(term_split[0])] = int(term_split[1])
      data.append([doc_index, doc_terms])
      count += 1
      if count % 100000 == 0:
      print("Processed %d00K" % int(count / 100000))
      count = 0
      new_filedata = []
      for doc in data:
      doc_string = str(doc[0])
      term_indices = list(doc[1].keys())
      term_indices.sort()
      for term_index in term_indices:
      doc_string += (" " + str(term_index) + ":" + str(doc[1][term_index]))
      doc_string += "\n"
      new_filedata.append(doc_string)
      count += 1
      if count % 100000 == 0:
      print("Generated %d00K" % int(count / 100000))
      with open(new_filename, 'w') as fp:
      fp.writelines(new_filedata)
      
    3. Press Esc, type :wq!, and press Enter to save the file and exit.
  8. Use reorder.py to reorder the dataset renamed in 6.
    1
    python3 reorder.py
    
  9. Create a /tmp/dataset/ml/nytimes folder in HDFS.
    1
    hadoop fs -mkdir -p /tmp/ml/dataset/nytimes/
    
  10. Upload the dataset to HDFS.
    1
    hadoop fs -put /test/dataset/ml/nytimes/docword.nytimes.txt.libsvm /tmp/ml/dataset/nytimes/
    

Downloading the Kosarak Dataset

  1. Create a /test/dataset/ml/Kosarak directory and go to the directory.
    1
    2
    mkdir -p /test/dataset/ml/Kosarak
    cd /test/dataset/ml/Kosarak
    
  2. Download the Kosarak dataset from the official website.
    1
    wget http://www.philippe-fournier-viger.com/spmf/datasets/kosarak_sequences.txt
    
  3. Create a /tmp/ml/dataset/Kosarak folder in HDFS.
    1
    hadoop fs -mkdir -p /tmp/ml/dataset/Kosarak/
    
  4. Upload the dataset to HDFS.
    1
    hadoop fs -put /test/dataset/ml/Kosarak/kosarak_sequences.txt /tmp/ml/dataset/Kosarak/
    

Downloading the DEEP1B Dataset

  1. Create a /test/dataset/ml/DEEP1B directory and go to the directory.
    1
    2
    mkdir -p /test/dataset/ml/DEEP1B
    cd /test/dataset/ml/DEEP1B
    
  2. Download the DEEP1B dataset from the official website.
    1
    wget http://ann-benchmarks.com/deep-image-96-angular.hdf5
    
  3. Create a processHDF5.py file. (Make sure correct indentation is applied in the Python file.)
    1. Create a file.
      1
      vi processHDF5.py
      
    2. Press i to enter the insert mode and add the following content to the file:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      import os
      import h5py
      # downloaded hdf5 file
      inputFile = h5py.File('deep-image-96-angular.hdf5', 'r')
      # directory name to store output files
      outputDir = "deep1b"
      # the number of samples in each output file
      samplesPerFile = 5000
      sampleCnt = 0
      fileCnt = 0
      writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
      data = inputFile['train']
      for feature in data:
      writer.write(','.join([str(d) for d in feature]) + "\n")
      sampleCnt += 1
      if sampleCnt == samplesPerFile:
      writer.close()
      fileCnt += 1
      sampleCnt = 0
      writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
      data = inputFile['test']
      for feature in data:
      writer.write(','.join([str(d) for d in feature]) + "\n")
      sampleCnt += 1
      if sampleCnt == samplesPerFile:
      writer.close()
      fileCnt += 1
      sampleCnt = 0
      writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
      writer.close()
      
    3. Press Esc, type :wq!, and press Enter to save the file and exit.
  4. Convert the HDF5 file into a text file. Each sample occupies a line, and features are separated by commas (,).
    1
    2
    mkdir deep1b
    python3 processHDF5.py
    
  5. If the following error information is displayed, run the python3 -m pip install h5py script.

  6. Create a /tmp/ml/dataset/DEEP1 folder in HDFS.
    1
    hadoop fs -mkdir -p /tmp/ml/dataset/DEEP1B
    
  7. Upload the dataset to HDFS.
    1
    hadoop fs -put /test/dataset/ml/deep1b/* /tmp/ml/dataset/DEEP1B/
    

Downloading the MNIST8M Dataset

  1. Create a /test/dataset/ml/mnist8m directory and go to the directory.
    1
    2
    mkdir -p /test/dataset/ml/mnist8m
    cd /test/dataset/ml/mnist8m
    
  2. Download the MNIST8M dataset from the official website.
    1
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.bz2
    
  3. Decompress the training dataset and test dataset to the current directory.
    1
    bzip2 -d mnist8m.bz2
    
  4. Create a /tmp/dataset/ml/mnist8m folder in HDFS.
    1
    hadoop fs -mkdir -p /tmp/ml/dataset/mnist8m
    
  5. Upload the dataset to HDFS.
    1
    hadoop fs -put /test/dataset/ml/mnist8m/mnist8m /tmp/ml/dataset/mnist8m
    
  6. Start spark-shell.
    1
    spark-shell
    
  7. Run the following command:
    1
    :paste
    
  8. Run the following code to split the dataset into a training dataset and a test dataset:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    val reader = spark.read.format("libsvm")
    reader.option("numFeatures", 784)
    val dataPath = "/tmp/ml/dataset/mnist8m"
    val data = reader.load(dataPath)
    val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020)
    val trainOutputPath = s"${dataPath}_train"
    val testOutputPath = s"${dataPath}_test"
    trainingData.write.format("libsvm").save(trainOutputPath)
    testData.write.format("libsvm").save(testOutputPath)
    
  9. Press Enter and run Ctrl + d.
  10. Delete unnecessary dataset directories from HDFS.
    1
    hadoop fs -rm -r /tmp/ml/dataset/mnist8m
    
  11. Check that the training dataset and test dataset exist in the HDFS directory.
    1
    hadoop fs -ls /tmp/ml/dataset
    

Downloading the Epsilon Dataset

  1. Create a /test/dataset/ml/epsilon directory and go to the directory.
    1
    2
    mkdir -p /test/dataset/ml/epsilon
    cd /test/dataset/ml/epsilon
    
  2. Download the Epsilon training dataset and test dataset from the official website.
    1
    2
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.bz2
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2
    
  3. Decompress the training dataset and test dataset to the current directory.
    1
    2
    bzip2 -d epsilon_normalized.bz2
    bzip2 -d epsilon_normalized.t.bz2
    

  4. Create /tmp/dataset/ml/epsilon_train and /tmp/dataset/ml/epsilon_test folders in HDFS.
    1
    2
    hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_train
    hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_test
    
  5. Upload the training dataset and test dataset to HDFS.
    1
    2
    hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized /tmp/ml/dataset/epsilon_train
    hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized.t /tmp/ml/dataset/epsilon_test
    

Downloading the MESH_DEFORM Dataset

  1. Create a /test/dataset/ml/mesh_deform directory and go to the directory.
    1
    2
    mkdir -p /test/dataset/ml/mesh_deform
    cd /test/dataset/ml/mesh_deform
    
  2. Download the MESH_DEFORM dataset from the official website.
    1
    wget https://suitesparse-collection-website.herokuapp.com/MM/Yoshiyasu/mesh_deform.tar.gz
    
  3. Decompress the dataset to the current directory.
    1
    tar zxvf mesh_deform.tar.gz
    
  4. Open the extracted mesh_deform.mtx file and delete lines 1 to 25. Lines 1 to 24 are information lines. Line 25 indicates the number of rows, columns, and non-zero elements in the matrix. The data starts from line 26.
    1
    vi mesh_deform.mtx
    
  5. Create a /tmp/ml/dataset/MESH_DEFORM folder in HDFS.
    1
    hadoop fs –mkdir -p /tmp/ml/dataset/MESH_DEFORM
    
  6. Upload the dataset to HDFS.
    1
    hadoop fs -put mesh_deform.mtx /tmp/ml/dataset/MESH_DEFORM/
    
  7. Check that the dataset exists in the HDFS directory.
    1
    hadoop fs -ls /tmp/ml/dataset/MESH_DEFORM