Performance Test Method

Use the vllm/benchmarks/benchmark_serving.py file to test the throughput.

Performance Test Example

Start the vLLM service in any path. Add any required optimization options before starting the service.
```
vllm serve /home/models/DeepSeek-R1-Distill-Llama-70B/ --tensor_parallel_size=8
```
Go to the vllm/benchmarks/ directory where the benchmark_serving.py test file is stored.

Perform the test. Table 1 describes the command options.

pip install datasets pandas
python benchmark_serving.py --model /home/models/DeepSeek-R1-Distill-Llama-70B/  --dataset_name random --random-input-len 2048 --random-output-len 2048 --trust-remote-code --ignore-eos --num-prompts 1 --request_rate 1

**Table 1** Performance test command options
Option	Description
--model /home/models/DeepSeek-R1-Distill-Llama-70B/	Specifies the model path.
--dataset_name random	Selects a random dataset.
--random-input-len 2048	Set the input length to 2,048 tokens.
--random-output-len 2048	Set the output length to 2,048 tokens.
--ignore-eos	Ignores the end-of-sequence token, continuing generation until the specified output length is reached.
--num-prompts 1	Sends one prompt during the test.
--request_rate 1	Sets the request rate to 1.

The test output is similar to the following figure. For details about the metrics, see Table 2. You can compare the time per output token (TPOT) excluding the first token to measure performance improvements from optimization.

**Table 2** Test output metrics
Metric	Description
Successful requests	Number of completed requests
Benchmark duration (s)	Duration of the benchmark test
Total input tokens	Total number of input tokens in all requests
Total generated tokens	Total number of output tokens for all requests
Request throughput (req/s)	Request throughput, measured by the number of requests processed per second
Output token throughput (tok/s)	Output token throughput, measured by the number of output tokens generated per second
Total Token throughput (tok/s)	Total token throughput (including input and output), measured by the total number of tokens processed per second
Time to First Token	Time to first token, that is, the duration from the time when a request is sent to the time when the first output token is received
Time per Output Token (excl. 1st token)	Time per output token excluding the first token
Inter-token Latency	Duration between two consecutive tokens