Rate This Document
Findability
Accuracy
Completeness
Readability

Performing Tests

This section describes how to perform performance tests in the inference phase. An existing model file can be used for the stress test in the inference phase.

Performance Test in the Inference Phase

  1. Go to the directory where ModelZoo is stored.
    1
    cd /path/to/sra_benchmark/modelzoo
    
  2. Perform the performance stress tests on Wide_and_Deep, DLRM, DeepFM, DFFM, and DSSM in the inference phase by running the following command:
    1
    python inference_throughput_test.py --test_method entire --meta_path /path/to/sra_benchmark --serving_path /path/to/tfserving --image nvcr.io/nvidia/tritonserver:24.05-py3-sdk --intra 0 --inter 0
    

    Table 1 describes the parameters in the stress test command. The throughput of each model during inference is saved in the inference_log folder under modelzoo.

    The following figure shows the files saving the stress test results of some models.

    For example, in the test scenario where the server has four NUMA nodes, the test results of DeepFM are recorded by NUMA ID in deepfm_client0.txt, deepfm_client1.txt, deepfm_client2.txt, and deepfm_client3.txt in the inference_log folder. If the test is performed on a single NUMA node, the test result is recorded in deepfm_client0.txt in the inference_log folder (in this case, only the NUMA0 node can be used).

    Table 1 Parameters of the stress test command in the inference phase

    Parameter

    Description

    --test_method

    Indicates NUMA resources used during the inference phase.

    • single: A single NUMA node NUMA0 is used.
    • entire: All NUMA nodes of the server are used (by defult).

    --meta_path

    Path of the sra_benchmark directory.

    --serving_path

    Path that contains the TensorFlow Serving executable binary file.

    --image

    Indicates the name and version of the Triton Server container for the stress test.

    --intra

    tensorflow_intra_op_parallelism, which indicates the number of parallel threads within the individual operation. The default value is 0.

    --inter

    tensorflow_inter_op_parallelism, which indicates the number of parallel threads between independent operations. The default value is 0.

Test Results

The following uses the inference performance stress test result of DeepFM as an example. Figure 1 shows part of the content in deepfm_client0.txt.

Figure 1 Example of DeepFM inference performance stress test result (stored in the deepfm_client0.txt file)
  • Concurrency: Number of concurrent requests sent by the client during the performance test.
  • throughput: Number of inference times per second. The unit is infer/sec.
  • latency: P99 latency, indicating that 99% of the request latency is lower than this value. The unit is usec.