我要评分
获取效率
正确性
完整性
易理解

Public Datasets

The machine learning test cases in this document use datasets from their official websites. Download house, HIGGS, NYTimes, Kosarak, DEEP1B, MNIST8M, Epsilon and MESH_DEFORM from the official websites. All the following datasets are downloaded, decompressed and uploaded on the server1 node.

Downloading the house Dataset

  1. Create a /test/dataset/ml directory and go to the directory.
    mkdir -p /test/dataset/ml
    cd /test/dataset/ml
  2. Download the house dataset here. Make sure that your network can access Google.

  3. Save the dataset downloaded in 2 to the /test/dataset/ml directory.
  4. Create a folder in HDFS.
    hadoop fs -mkdir -p /tmp/dataset/ml
    hadoop fs -mkdir -p /tmp/ml/dataset
  5. Upload the dataset to /tmp/dataset/ml.
    hadoop fs -put /test/dataset/ml/house.ds /tmp/dataset/ml
  6. Start spark-shell.
    spark-shell
  7. Run the following command (do not omit the colon):
    :paste
  8. Execute the following code to process the dataset:
    val file = sc.textFile("/tmp/dataset/ml/house.ds")
    file.take(10).foreach(println(_))
    file.count
    val data = file.map(x => x.split(" ")).filter(_.length == 8).map(x => x.slice(1, 8).mkString(" "))
    data.count
    data.take(10).foreach(println(_))
    data.repartition(1).saveAsTextFile("/tmp/ml/dataset/house")
  9. Press Enter and press Ctrl+D.
  10. Check that the training dataset and test dataset exist in the HDFS directory.
    hadoop fs -ls /tmp/ml/dataset/house

  11. Delete unnecessary dataset directories from HDFS.
    hadoop fs -rm -r /tmp/dataset/mlhadoop fs -rm -r /tmp/dataset/ml

Downloading the HIGGS Dataset

  1. Create a /test/dataset/ml/higgs directory and go to the directory.
    mkdir -p /test/dataset/ml/higgs
    cd /test/dataset/ml/higgs
  2. Download the HIGGS dataset from the official website.
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/HIGGS.bz2
  3. Decompress the dataset to the current directory.
    bzip2 -d HIGGS.bz2
  4. Create a /tmp/dataset/ml/higgs folder in HDFS.
    hadoop fs -mkdir -p /tmp/ml/dataset/higgs
  5. Upload the dataset to HDFS.
    hadoop fs -put /test/dataset/ml/higgs/HIGGS /tmp/ml/dataset/higgs
  6. Start spark-shell.
    spark-shell
  7. Run the following command (do not omit the colon):
    :paste
  8. Run the following code to split the dataset into a training dataset and a test dataset:
    val reader = spark.read.format("libsvm")
    reader.option("numFeatures", 28)
    val dataPath = "/tmp/ml/dataset/higgs"
    val data = reader.load(dataPath)
    val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020)
    val trainOutputPath = s"${dataPath}_train"
    val testOutputPath = s"${dataPath}_test"
    trainingData.write.format("libsvm").save(trainOutputPath)
    testData.write.format("libsvm").save(testOutputPath)
  9. Press Enter and press Ctrl+D.
  10. Delete unnecessary dataset directories from HDFS.
    hadoop fs -rm -r /tmp/ml/dataset/higgs
  11. Check that the training dataset and test dataset exist in the HDFS directory.
    hadoop fs -ls /tmp/ml/dataset

Downloading the NYTimes Dataset

  1. Create a /test/dataset/ml/nytimes directory and go to the directory.
    mkdir -p /test/dataset/ml/nytimes
    cd /test/dataset/ml/nytimes
  2. Download the NYTimes dataset from the official website.
    wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz
  3. Decompress the dataset to the current directory.
    gzip -d docword.nytimes.txt.gz
  4. Create a dataset_process.py file. (Make sure correct indentation is applied in the Python file.)
    vim dataset_process.py

    The file content is as follows:

    import sys
    if __name__ == "__main__":
    if len(sys.argv) <= 1:
    print("Please input dataset")
    exit()
    filename = sys.argv[1]
    print("Reading data")
    processed_data = {}
    with open(filename, 'r') as fp:
    data = fp.readlines()
    print("Pre-processing data")
    for line in data[3:]:
    line_split = line.strip().split()
    if len(line_split) < 3:
    continue
    doc_id = int(line_split[0])
    vocab_id = line_split[1]
    term_num = line_split[2]
    if doc_id not in processed_data:
    processed_data[doc_id] = str(doc_id)
    processed_data[doc_id] += (" %s:%s" % (vocab_id, term_num))
    print("Post-processing data")
    doc_ids = list(processed_data.keys())
    doc_ids.sort()
    data = []
    for doc_id in doc_ids:
    data.append(processed_data[doc_id] + "\n")
    print("Writing data")
    with open(filename + ".libsvm", 'w') as fp:
    fp.writelines(data)
  5. Use dataset_process.py to convert the dataset to the LibSVM format.
    python3 dataset_process.py docword.nytimes.txt
  6. Rename dataset docword.nytimes.txt.libsvm as docword.nytimes.txt.libsvm.raw.
  7. Create a reorder.py file. (Make sure correct indentation is applied in the Python file.)
    vim reorder.py

    The file content is as follows:

    filename = "docword.nytimes.txt.libsvm.raw"
    new_filename = "docword.nytimes.txt.libsvm"
    with open(filename, 'r') as fp:
    filedata = fp.readlines()
    print("Data length: %d" % len(filedata))
    count = 0
    data = []
    for line in filedata:
    line_split = line.strip().split()
    doc_index = int(line_split[0])
    doc_terms = {}
    for term in line_split[1:]:
    term_split = term.strip().split(":")
    assert int(term_split[0]) not in doc_terms
    doc_terms[int(term_split[0])] = int(term_split[1])
    data.append([doc_index, doc_terms])
    count += 1
    if count % 100000 == 0:
    print("Processed %d00K" % int(count / 100000))
    count = 0
    new_filedata = []
    for doc in data:
    doc_string = str(doc[0])
    term_indices = list(doc[1].keys())
    term_indices.sort()
    for term_index in term_indices:
    doc_string += (" " + str(term_index) + ":" + str(doc[1][term_index]))
    doc_string += "\n"
    new_filedata.append(doc_string)
    count += 1
    if count % 100000 == 0:
    print("Generated %d00K" % int(count / 100000))
    with open(new_filename, 'w') as fp:
    fp.writelines(new_filedata)
  8. Use reorder.py to reorder the dataset renamed in 6.
    python3 reorder.py
  9. Create a /tmp/dataset/ml/nytimes folder in HDFS.
    hadoop fs -mkdir -p /tmp/ml/dataset/nytimes/
  10. Upload the dataset to HDFS.
    hadoop fs -put /test/dataset/ml/nytimes/docword.nytimes.txt.libsvm /tmp/ml/dataset/nytimes/

Downloading the Kosarak Dataset

  1. Create a /test/dataset/ml/Kosarak directory and go to the directory.
    mkdir -p /test/dataset/ml/Kosarak
    cd /test/dataset/ml/Kosarak
  2. Download the Kosarak dataset from the official website.
    wget http://www.philippe-fournier-viger.com/spmf/datasets/kosarak_sequences.txt
  3. Create a /tmp/ml/dataset/Kosarak folder in HDFS.
    hadoop fs -mkdir -p /tmp/ml/dataset/Kosarak/
  4. Upload the dataset to HDFS.
    hadoop fs -put /test/dataset/ml/Kosarak/kosarak_sequences.txt /tmp/ml/dataset/Kosarak/

Downloading the DEEP1B Dataset

  1. Create a /test/dataset/ml/DEEP1B directory and go to the directory.
    mkdir -p /test/dataset/ml/DEEP1B
    cd /test/dataset/ml/DEEP1B
  2. Download the DEEP1B dataset from the official website.
    wget http://ann-benchmarks.com/deep-image-96-angular.hdf5
  3. Create a processHDF5.py file. (Make sure correct indentation is applied in the Python file.)
    vim processHDF5.py

    The file content is as follows:

    import os
    import h5py
    # downloaded hdf5 file
    inputFile = h5py.File('deep-image-96-angular.hdf5', 'r')
    # directory name to store output files
    outputDir = "deep1b"
    # the number of samples in each output file
    samplesPerFile = 5000
    sampleCnt = 0
    fileCnt = 0
    writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
    data = inputFile['train']
    for feature in data:
    writer.write(','.join([str(d) for d in feature]) + "\n")
    sampleCnt += 1
    if sampleCnt == samplesPerFile:
    writer.close()
    fileCnt += 1
    sampleCnt = 0
    writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
    data = inputFile['test']
    for feature in data:
    writer.write(','.join([str(d) for d in feature]) + "\n")
    sampleCnt += 1
    if sampleCnt == samplesPerFile:
    writer.close()
    fileCnt += 1
    sampleCnt = 0
    writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
    writer.close()
  4. Convert the HDF5 file into a text file. Each sample occupies a line, and features are separated by commas (,).
    mkdir deep1b
    python3 processHDF5.py
  5. If the following error information is displayed, run the python3 -m pip install h5py script.

  6. Create a /tmp/ml/dataset/DEEP1 folder in HDFS.
    hadoop fs -mkdir -p /tmp/ml/dataset/DEEP1B
  7. Upload the dataset to HDFS.
    hadoop fs -put /test/dataset/ml/deep1b/* /tmp/ml/dataset/DEEP1B/

Downloading the MNIST8M Dataset

  1. Create a /test/dataset/ml/mnist8m directory and go to the directory.
    mkdir -p /test/dataset/ml/mnist8m
    cd /test/dataset/ml/mnist8m
  2. Download the MNIST8M dataset from the official website.
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.bz2
  3. Decompress the training dataset and test dataset to the current directory.
    bzip2 -d mnist8m.bz2
  4. Create a /tmp/dataset/ml/mnist8m folder in HDFS.
    hadoop fs -mkdir -p /tmp/ml/dataset/mnist8m
  5. Upload the dataset to HDFS.
    hadoop fs -put /test/dataset/ml/mnist8m/mnist8m /tmp/ml/dataset/mnist8m
  6. Start spark-shell.
    spark-shell
  7. Run the following command (do not omit the colon):
    :paste
  8. Run the following code to split the dataset into a training dataset and a test dataset:
    val reader = spark.read.format("libsvm")
    reader.option("numFeatures", 784)
    val dataPath = "/tmp/ml/dataset/mnist8m"
    val data = reader.load(dataPath)
    val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020)
    val trainOutputPath = s"${dataPath}_train"
    val testOutputPath = s"${dataPath}_test"
    trainingData.write.format("libsvm").save(trainOutputPath)
    testData.write.format("libsvm").save(testOutputPath)
  9. Press Enter and press Ctrl+D.
  10. Delete unnecessary dataset directories from HDFS.
    hadoop fs -rm -r /tmp/ml/dataset/mnist8m
  11. Check that the training dataset and test dataset exist in the HDFS directory.
    hadoop fs -ls /tmp/ml/dataset

Downloading the Epsilon Dataset

  1. Create a /test/dataset/ml/epsilon directory and go to the directory.
    mkdir -p /test/dataset/ml/epsilon
    cd /test/dataset/ml/epsilon
  2. Download the Epsilon training dataset and test dataset from the official website.
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.bz2
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2
  3. Decompress the training dataset and test dataset to the current directory.
    bzip2 -d epsilon_normalized.bz2
    bzip2 -d epsilon_normalized.t.bz2

  4. Create /tmp/dataset/ml/epsilon_train and /tmp/dataset/ml/epsilon_test folders in HDFS.
    hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_train
    hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_test
  5. Upload the training dataset and test dataset to HDFS.
    hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized /tmp/ml/dataset/epsilon_train
    hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized.t /tmp/ml/dataset/epsilon_test

Downloading the MESH_DEFORM Dataset

  1. Create a /test/dataset/ml/mesh_deform directory and go to the directory.
    mkdir -p /test/dataset/ml/mesh_deform
    cd /test/dataset/ml/mesh_deform
  2. Download the MESH_DEFORM dataset from the official website.
    wget https://suitesparse-collection-website.herokuapp.com/MM/Yoshiyasu/mesh_deform.tar.gz
  3. Decompress the dataset to the current directory.
    tar zxvf mesh_deform.tar.gz
  4. Open the extracted mesh_deform.mtx file and delete lines 1 to 25. Lines 1 to 24 are information lines. Line 25 indicates the number of rows, columns, and non-zero elements in the matrix. The data starts from line 26.
    vim mesh_deform.mtx
  5. Create a /tmp/ml/dataset/MESH_DEFORM folder in HDFS.
    hadoop fs –mkdir -p /tmp/ml/dataset/MESH_DEFORM
  6. Upload the dataset to HDFS.
    hadoop fs -put mesh_deform.mtx /tmp/ml/dataset/MESH_DEFORM/
  7. Check that the dataset exists in the HDFS directory.
    hadoop fs -ls /tmp/ml/dataset/MESH_DEFORM