Principles

This section details TF Serving's thread pool architecture for inference, clarifying the principles of the feature to guide optimal configuration decisions.

Figure 1 TF Serving thread pool overview

The inference threads in TF Serving fall into two functional categories: communication threads and computing threads.

Communication threads:

grpcpp_sync_ser threads manage end-to-end client requests (including parsing, inference triggering, and response delivery).

Computing threads:

tf_Compute threads coordinate parallel tasks across operators.
tf_numa_-1_Eige threads execute intra-operator parallel tasks.

XLA-enabled deployments spawn additional specialized threads:

host_executor threads coordinate parallel tasks across XLA operators.
tf_XLAEigen threads execute intra-XLA operator parallel tasks.

Figure 2 shows the overall inference request handling process.

Figure 2 Inference request handling process

Client requests are processed by grpcpp_sync_ser threads for parsing before triggering session-based inference execution. Parallel operator processing occurs through tf_Compute or host_executor threads, with tf_numa_-1_Eige or tf_XLAEigen threads handling intra-operator parallel computing.

Kunpeng BoostKit improves the operator scheduling algorithm and uses batch operator scheduling. Figure 3 shows the overall inference process.

Figure 3 Inference process after optimization

Client requests are parsed by grpcpp_sync_ser threads before triggering session-based inference, with operators now running sequentially in tf_Compute threads (disabling intra-operator parallelism).

This optimization reduces cross-session interference, enabling lower per-session inference latency, improved TF Serving concurrency, and additional gains from thread affinity isolation between communication and computing threads.

The thread scheduling feature enables:

Operator batch scheduling (via --batch_op_scheduling) for enhanced throughput in high-concurrency scenarios
Synchronized XLA thread pool optimization, which activates concurrent scheduling of XLA operators to the current thread, minimizing context switching costs
Configurable thread affinity isolation (via --task_affinity_isolation) binding communication and computing threads to different CPU cores

For details about the function configuration, see Usage Description.

Parent topic: Feature Description