Performance Test Method
Use the vllm/benchmarks/benchmark_serving.py file to test the throughput.
Performance Test Example
- Start the vLLM service in any path. Add any required optimization options before starting the service.
vllm serve /home/models/DeepSeek-R1-Distill-Llama-70B/ --tensor_parallel_size=8
- Go to the vllm/benchmarks/ directory where the benchmark_serving.py test file is stored.
- Perform the test. Table 1 describes the command options.
pip install datasets pandas python benchmark_serving.py --model /home/models/DeepSeek-R1-Distill-Llama-70B/ --dataset_name random --random-input-len 2048 --random-output-len 2048 --trust-remote-code --ignore-eos --num-prompts 1 --request_rate 1
Table 1 Performance test command options Option
Description
--model /home/models/DeepSeek-R1-Distill-Llama-70B/
Specifies the model path.
--dataset_name random
Selects a random dataset.
--random-input-len 2048
Set the input length to 2,048 tokens.
--random-output-len 2048
Set the output length to 2,048 tokens.
--ignore-eos
Ignores the end-of-sequence token, continuing generation until the specified output length is reached.
--num-prompts 1
Sends one prompt during the test.
--request_rate 1
Sets the request rate to 1.
The test output is similar to the following figure. For details about the metrics, see Table 2. You can compare the time per output token (TPOT) excluding the first token to measure performance improvements from optimization.

Table 2 Test output metrics Metric
Description
Successful requests
Number of completed requests
Benchmark duration (s)
Duration of the benchmark test
Total input tokens
Total number of input tokens in all requests
Total generated tokens
Total number of output tokens for all requests
Request throughput (req/s)
Request throughput, measured by the number of requests processed per second
Output token throughput (tok/s)
Output token throughput, measured by the number of output tokens generated per second
Total Token throughput (tok/s)
Total token throughput (including input and output), measured by the total number of tokens processed per second
Time to First Token
Time to first token, that is, the duration from the time when a request is sent to the time when the first output token is received
Time per Output Token (excl. 1st token)
Time per output token excluding the first token
Inter-token Latency
Duration between two consecutive tokens