Performing Tests

This section describes how to perform performance tests in the inference phase. An existing model file can be used for the stress test in the inference phase.

Performance Test in the Inference Phase

Go to the directory where ModelZoo is stored.
1
cd /path/to/sra_benchmark/modelzoo

Perform the performance stress tests on Wide_and_Deep, DLRM, DeepFM, DFFM, and DSSM in the inference phase by running the following command:

python inference_throughput_test.py --test_method entire --meta_path /path/to/sra_benchmark --serving_path /path/to/tfserving --image nvcr.io/nvidia/tritonserver:24.05-py3-sdk --intra 0 --inter 0

Table 1 describes the parameters in the stress test command. The throughput of each model during inference is saved in the inference_log folder under modelzoo.

The following figure shows the files saving the stress test results of some models.

For example, in the test scenario where the server has four NUMA nodes, the test results of DeepFM are recorded by NUMA ID in deepfm_client0.txt, deepfm_client1.txt, deepfm_client2.txt, and deepfm_client3.txt in the inference_log folder. If the test is performed on a single NUMA node, the test result is recorded in deepfm_client0.txt in the inference_log folder (in this case, only the NUMA0 node can be used).

**Table 1** Parameters of the stress test command in the inference phase
Parameter	Description
--test_method	Indicates NUMA resources used during the inference phase. single: A single NUMA node NUMA0 is used. entire: All NUMA nodes of the server are used (by defult).
--meta_path	Path of the sra_benchmark directory.
--serving_path	Path that contains the TensorFlow Serving executable binary file.
--image	Indicates the name and version of the Triton Server container for the stress test.
--intra	tensorflow_intra_op_parallelism, which indicates the number of parallel threads within the individual operation. The default value is 0.
--inter	tensorflow_inter_op_parallelism, which indicates the number of parallel threads between independent operations. The default value is 0.

Test Results

The following uses the inference performance stress test result of DeepFM as an example. Figure 1 shows part of the content in deepfm_client0.txt.

Figure 1 Example of DeepFM inference performance stress test result (stored in the deepfm_client0.txt file)

Concurrency: Number of concurrent requests sent by the client during the performance test.
throughput: Number of inference times per second. The unit is infer/sec.
latency: P99 latency, indicating that 99% of the request latency is lower than this value. The unit is usec.

Parent topic: Model Inference on TensorFlow Serving