Rate This Document
Findability
Accuracy
Completeness
Readability

Performance Test Method

Use the vllm/benchmarks/benchmark_serving.py file to test the throughput.

Performance Test Example

  1. Start the vLLM service in any path. Add any required optimization options before starting the service.
    vllm serve /home/models/DeepSeek-R1-Distill-Llama-70B/ --tensor_parallel_size=8
  2. Go to the vllm/benchmarks/ directory where the benchmark_serving.py test file is stored.
  3. Perform the test. Table 1 describes the command options.
    pip install datasets pandas
    python benchmark_serving.py --model /home/models/DeepSeek-R1-Distill-Llama-70B/  --dataset_name random --random-input-len 2048 --random-output-len 2048 --trust-remote-code --ignore-eos --num-prompts 1 --request_rate 1
    Table 1 Performance test command options

    Option

    Description

    --model /home/models/DeepSeek-R1-Distill-Llama-70B/

    Specifies the model path.

    --dataset_name random

    Selects a random dataset.

    --random-input-len 2048

    Set the input length to 2,048 tokens.

    --random-output-len 2048

    Set the output length to 2,048 tokens.

    --ignore-eos

    Ignores the end-of-sequence token, continuing generation until the specified output length is reached.

    --num-prompts 1

    Sends one prompt during the test.

    --request_rate 1

    Sets the request rate to 1.

    The test output is similar to the following figure. For details about the metrics, see Table 2. You can compare the time per output token (TPOT) excluding the first token to measure performance improvements from optimization.

    Table 2 Test output metrics

    Metric

    Description

    Successful requests

    Number of completed requests

    Benchmark duration (s)

    Duration of the benchmark test

    Total input tokens

    Total number of input tokens in all requests

    Total generated tokens

    Total number of output tokens for all requests

    Request throughput (req/s)

    Request throughput, measured by the number of requests processed per second

    Output token throughput (tok/s)

    Output token throughput, measured by the number of output tokens generated per second

    Total Token throughput (tok/s)

    Total token throughput (including input and output), measured by the total number of tokens processed per second

    Time to First Token

    Time to first token, that is, the duration from the time when a request is sent to the time when the first output token is received

    Time per Output Token (excl. 1st token)

    Time per output token excluding the first token

    Inter-token Latency

    Duration between two consecutive tokens