Rate This Document
Findability
Accuracy
Completeness
Readability

Training the DLRM

This document does not involve model tuning. It mainly demonstrates the overall process of training the DLRM on the TensorFlow framework. Use the network parameters in the source code for the training because the relatively small number of parameters increase the training speed. However, it is advised to tune the Area Under the Curve (AUC) data in the network parameters to further improve the training effect.

  1. Run the dlrm_criteo_gpu.py script to train the DLRM. The dlrm_criteo_gpu.py code automatically checks for any GPUs in the current environment. If no GPUs exist, CPUs are used for the training.
    python dlrm_criteo_gpu.py

    If the actual training log information is basically the same as the preceding command output, the training is normal. The AUC value may be different from that displayed in the preceding figure and the training effect may be slightly different.

    The Area Under the Curve (AUC) measures the generalization performance of a model, that is, the classification effect. A higher AUC value indicates better effect. The maximum value is 1.

  2. View the generated model file.
    ls -la

    If the checkpoint, mymodel.data-00000-of-00001, and mymodel.index files are generated, the model training is successful.