Principles
This section details TF Serving's thread pool architecture for inference, clarifying the principles of the feature to guide optimal configuration decisions.

The inference threads in TF Serving fall into two functional categories: communication threads and computing threads.
Communication threads:
- grpcpp_sync_ser threads manage end-to-end client requests (including parsing, inference triggering, and response delivery).
Computing threads:
- tf_Compute threads coordinate parallel tasks across operators.
- tf_numa_-1_Eige threads execute intra-operator parallel tasks.
XLA-enabled deployments spawn additional specialized threads:
- host_executor threads coordinate parallel tasks across XLA operators.
- tf_XLAEigen threads execute intra-XLA operator parallel tasks.
Figure 2 shows the overall inference request handling process.
Client requests are processed by grpcpp_sync_ser threads for parsing before triggering session-based inference execution. Parallel operator processing occurs through tf_Compute or host_executor threads, with tf_numa_-1_Eige or tf_XLAEigen threads handling intra-operator parallel computing.
Kunpeng BoostKit improves the operator scheduling algorithm and uses batch operator scheduling. Figure 3 shows the overall inference process.
Client requests are parsed by grpcpp_sync_ser threads before triggering session-based inference, with operators now running sequentially in tf_Compute threads (disabling intra-operator parallelism).
This optimization reduces cross-session interference, enabling lower per-session inference latency, improved TF Serving concurrency, and additional gains from thread affinity isolation between communication and computing threads.
The thread scheduling feature enables:
- Operator batch scheduling (via --batch_op_scheduling) for enhanced throughput in high-concurrency scenarios
- Synchronized XLA thread pool optimization, which activates concurrent scheduling of XLA operators to the current thread, minimizing context switching costs
- Configurable thread affinity isolation (via --task_affinity_isolation) binding communication and computing threads to different CPU cores
For details about the function configuration, see Usage Description.

