Starting the Service and Performing a Pressure Test
This section uses the DeepFM model as an example. Obtain the model and deploy it using TF-Serving. Then, create a Docker container, and use perf_analyzer to perform a pressure test.
Downloading a Model
- Go to the /path/to/models directory.
cd /path/to/models
- Obtain the DeepFM model.
git clone https://gitee.com/openeuler/sra_benchmark.git -b v0.1.0
- Check whether there is the DeepFM model.
ls sra_benchmark/models/
Starting the TF-Serving Service
- Go to the /path/to/tfserving/serving directory.
cd /path/to/tfserving/serving
- Set environment variables and enable the XLA function.
export TF_XLA_FLAGS="--tf_xla_auto_jit=2 --tf_xla_cpu_global_jit"
- Start the TF-Serving service. The following uses NUMA 2 as an example. For details about the parameters in the startup command, see Table 1.
numactl -m 2 -C 160-239 ./bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=8869 --rest_api_port=8870 --model_base_path=/path/to/models/sra_benchmark/models/model_DeepFM_1730799407 --model_name=deepfm --tensorflow_intra_op_parallelism=51 --tensorflow_inter_op_parallelism=27 --xla_cpu_compilation_enabled=true
The execution result is as follows:

numactl -m 2 -C 160-239 indicates that the TF-Serving service runs on the core corresponding to NUMA 2. In this mode, CPU resources can be fully utilized. -m indicates that the memory corresponding to NUMA 2 is used, and -C specifies the core corresponding to NUMA 2.
Processor model: New Kunpeng 920 processor model (hyper-threading enabled).
Table 1 Parameters in the TF-Serving service startup command Parameter
Description
--port
gRPC service port.
--rest_api_port
HTTP service port.
--model_base_path
Model path.
--model_name
Model name.
--tensorflow_intra_op_parallelism
Intra-operator parallelism.
--tensorflow_inter_op_parallelism
Inter-operator parallelism.
--xla_cpu_compilation_enabled
Whether to enable XLA.
Pressure Test on the Client
- Use PuTTY to start another session and log in to the server as the root user.
- Create a Docker container and use perf_analyzer to perform a pressure test. Table 2 describes the parameters in the pressure test command.
docker run -it --rm --net host nvcr.io/nvidia/tritonserver:24.05-py3-sdk perf_analyzer --concurrency-range 28:28:1 -p 8561 -f perf.csv -m deepfm --service-kind tfserving -i grpc --request-distribution poisson -b 660 -u localhost:8869 --percentile 99 --input-data=random
The execution result is as follows:

In the command output, Throughput indicates the number of queries or transactions that can be processed per second. It is an indicator that measures the request processing capability of a server.
If the parameters configured in this document are used, the values must be approximately equal to those in the preceding figure, and the error must be within the fluctuation range. (The default BIOS configurations are used, and hyper-threading is enabled.)
Processor model: New Kunpeng 920 processor model (hyper-threading enabled).
Table 2 Parameters in the pressure test command Parameter
Description
--concurrency-range
Client concurrency range.
-m
Pressure test model.
--service-kind
Server-side inference framework.
-i
Communication protocol.
-u
Service port, which must be the same as the corresponding port of TF-Serving.
-b
Batch size of a model.
--percentile
Latency requirement.
--input-data
Sends random data.