Performing Tests
This section describes how to perform performance tests in the inference phase. An existing model file can be used for the stress test in the inference phase.
Performance Test in the Inference Phase
- Go to the directory where ModelZoo is stored.
1cd /path/to/sra_benchmark/modelzoo
- Perform the performance stress tests on Wide_and_Deep, DLRM, DeepFM, DFFM, and DSSM in the inference phase by running the following command:
1python inference_throughput_test.py --test_method entire --meta_path /path/to/sra_benchmark --serving_path /path/to/tfserving --image nvcr.io/nvidia/tritonserver:24.05-py3-sdk --intra 0 --inter 0
Table 1 describes the parameters in the stress test command. The throughput of each model during inference is saved in the inference_log folder under modelzoo.
The following figure shows the files saving the stress test results of some models.

For example, in the test scenario where the server has four NUMA nodes, the test results of DeepFM are recorded by NUMA ID in deepfm_client0.txt, deepfm_client1.txt, deepfm_client2.txt, and deepfm_client3.txt in the inference_log folder. If the test is performed on a single NUMA node, the test result is recorded in deepfm_client0.txt in the inference_log folder (in this case, only the NUMA0 node can be used).
Table 1 Parameters of the stress test command in the inference phase Parameter
Description
--test_method
Indicates NUMA resources used during the inference phase.
- single: A single NUMA node NUMA0 is used.
- entire: All NUMA nodes of the server are used (by defult).
--meta_path
Path of the sra_benchmark directory.
--serving_path
Path that contains the TensorFlow Serving executable binary file.
--image
Indicates the name and version of the Triton Server container for the stress test.
--intra
tensorflow_intra_op_parallelism, which indicates the number of parallel threads within the individual operation. The default value is 0.
--inter
tensorflow_inter_op_parallelism, which indicates the number of parallel threads between independent operations. The default value is 0.
Test Results
The following uses the inference performance stress test result of DeepFM as an example. Figure 1 shows part of the content in deepfm_client0.txt.

- Concurrency: Number of concurrent requests sent by the client during the performance test.
- throughput: Number of inference times per second. The unit is infer/sec.
- latency: P99 latency, indicating that 99% of the request latency is lower than this value. The unit is usec.