Preprocessing Data

DLRM training uses Criteo datasets. The official dataset collection is 1 TB, which is too large for demonstration. This document uses only the click-through-rate dataset as an example to describe how to preprocess data.

Download the dataset to the local host. Before downloading the dataset, you need to log in to Kaggle.
https://www.kaggle.com/datasets/mrkmakr/criteo-dataset?resource=download
Use PuTTY to upload the dataset file archive.zip to the planned dataset path /path/to/dataset.
Use PuTTY to log in to the server as the root user and go to the /path/to/dataset path.
```
cd /path/to/dataset
```
Decompress archive.zip and go to the dac directory.
```
unzip archive.zip -d ./ 
cd dac
```
Create a data_process.py file in the dac directory to preprocess the dataset and generate the target NPZ file.
1. Create a data_process.py file.
```
vi data_process.py
```
2. Press i to enter the insert mode and edit the data_process.py file. For details, see data_process.py File.
3. Press Esc, type :wq!, and press Enter to save the file and exit.
Execute the data_process.py file to preprocess data.
```
python data_process.py
```
If the preceding information is displayed, data processing is successful. The data preprocessing process takes approximately 5 minutes. It is normal that the actual time elapsed is shorter or longer.
Copy the kaggle_processed.npz file generated after data preprocessing to the /path/to/dataset/criteo directory and verify the copying operation.
```
mkdir ../criteo  
cp ./kaggle_data/kaggle_processed.npz  ../criteo/ 
ll ../criteo/
```
If the preceding information is displayed, kaggle_processed.npz has been successfully copied to /path/to/dataset/criteo.

Parent topic: Training the DLRM