Rate This Document
Findability
Accuracy
Completeness
Readability

Preprocessing Data

DLRM training uses Criteo datasets. The official dataset collection is 1 TB, which is too large for demonstration. This document uses only the click-through-rate dataset as an example to describe how to preprocess data.

  1. Download the dataset to the local host. Before downloading the dataset, you need to log in to Kaggle.
  2. Use PuTTY to upload the dataset file archive.zip to the planned dataset path /path/to/dataset.
  3. Use PuTTY to log in to the server as the root user and go to the /path/to/dataset path.
    cd /path/to/dataset
  4. Decompress archive.zip and go to the dac directory.
    unzip archive.zip -d ./ 
    cd dac
  5. Create a data_process.py file in the dac directory to preprocess the dataset and generate the target NPZ file.
    1. Create a data_process.py file.
      vi data_process.py
    2. Press i to enter the insert mode and edit the data_process.py file. For details, see data_process.py File.
    3. Press Esc, type :wq!, and press Enter to save the file and exit.
  6. Execute the data_process.py file to preprocess data.
    python data_process.py

    If the preceding information is displayed, data processing is successful. The data preprocessing process takes approximately 5 minutes. It is normal that the actual time elapsed is shorter or longer.

  7. Copy the kaggle_processed.npz file generated after data preprocessing to the /path/to/dataset/criteo directory and verify the copying operation.
    mkdir ../criteo  
    cp ./kaggle_data/kaggle_processed.npz  ../criteo/ 
    ll ../criteo/

    If the preceding information is displayed, kaggle_processed.npz has been successfully copied to /path/to/dataset/criteo.