Preprocessing Data
DLRM training uses Criteo datasets. The official dataset collection is 1 TB, which is too large for demonstration. This document uses only the click-through-rate dataset as an example to describe how to preprocess data.
- Download the dataset to the local host. Before downloading the dataset, you need to log in to Kaggle.
- Use PuTTY to upload the dataset file archive.zip to the planned dataset path /path/to/dataset.
- Use PuTTY to log in to the server as the root user and go to the /path/to/dataset path.
cd /path/to/dataset
- Decompress archive.zip and go to the dac directory.
unzip archive.zip -d ./ cd dac
- Create a data_process.py file in the dac directory to preprocess the dataset and generate the target NPZ file.
- Create a data_process.py file.
vi data_process.py
- Press i to enter the insert mode and edit the data_process.py file. For details, see data_process.py File.
- Press Esc, type :wq!, and press Enter to save the file and exit.
- Create a data_process.py file.
- Execute the data_process.py file to preprocess data.
python data_process.py

If the preceding information is displayed, data processing is successful. The data preprocessing process takes approximately 5 minutes. It is normal that the actual time elapsed is shorter or longer.
- Copy the kaggle_processed.npz file generated after data preprocessing to the /path/to/dataset/criteo directory and verify the copying operation.
mkdir ../criteo cp ./kaggle_data/kaggle_processed.npz ../criteo/ ll ../criteo/

If the preceding information is displayed, kaggle_processed.npz has been successfully copied to /path/to/dataset/criteo.
Parent topic: Training the DLRM