Usage Description

The thread scheduling optimization feature of Kunpeng BoostKit provides two feature switches: batch operator scheduling and thread affinity isolation. You can configure the switches as required.

For details about how to use TF Serving to start the inference pressure test, see Starting the Service and Performing a Pressure Test in the TensorFlow Serving Porting Guide.

Batch Operator Scheduling

TF Serving Command Interface

--batch_op_scheduling

Function

Enables the operator scheduling optimization and XLA thread pool management optimization features.

Parameter Type

bool

Value Range

true and false. Set true to enable the feature or false to diable the feature.

Recommended Scenario

Recommended when single-core inference latency meets requirements, this option enhances concurrent processing capability and overall throughput.

Recommended Configuration

--tensorflow_intra_op_parallelism=1: Sets the intra-operator parallelism degree to 1.
--tensorflow_inter_op_parallelism=80: Sets the inter-operator parallelism degree to the number of CPU cores.
--batch_op_scheduling=true: Enables the batch operator scheduling feature.

Example

/path/to/tensorflow_model_server  --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=80 --batch_op_scheduling=true

Thread Affinity Isolation

TF Serving Command Interface

--task_affinity_isolation

Function

Enables the thread affinity isolation feature, which offers two isolation methods.

Sequential core binding allocates TensorFlow computing threads to the first K cores and TF Serving communication threads to remaining cores.
Interleaved core binding assigns TensorFlow threads to physical cores and TF Serving threads to virtual cores (recommended when hyper-threading is enabled).

Parameter Type

std::string

Parameter Format

mode;m-n;k. The default value is 0.

Value Range

For details, see Table 1.

Recommended Scenario

When TensorFlow scheduling is used, sequential core binding is recommended.
When both batch operator scheduling and thread affinity isolation are used, interleaved core binding is recommended.

Example

A server has four NUMA nodes, each containing 40 physical cores (160 in total) or 80 logical cores (320 in total) with hyper-threading enabled.

For TensorFlow scheduling mode, use these reference parameters:

numactl -C 0-79 -m 0 /path/to/tensorflow_model_server  --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=75 --tensorflow_inter_op_parallelism=75 --task_affinity_isolation="1;0-79;75"

With --batch_op_scheduling enabled, set --tensorflow_inter_op_parallelism to match the physical core count. Reference parameters:

numactl -C 0-79 -m 0 /path/to/tensorflow_model_server  --port=8850 --rest_api_port=8851 --model_base_path=/path/to/saved_model/ --model_name=model --tensorflow_intra_op_parallelism=1 --tensorflow_inter_op_parallelism=40 --batch_op_scheduling=true --task_affinity_isolation="2;0-79"

**Table 1** Thread affinity isolation parameter values
Parameter	Value Range	Description	Constraints
mode	0, 1, 2	0 (OFF): Thread affinity is disabled. 1 (ORDER): Cores are bound in sequence. 2 (INTERVAL): Cores are bound in an interleaved manner.	When mode is set to 0, the m-n and k are invalid and can be omitted.
m-n	Available CPU cores	The core binding range is [m, n].	m ≤ n
k	Available CPU cores	Number of cores allocated to the TensorFlow thread.	k is no more than the total number of bound cores (n - m + 1). When mode is set to 2, k is invalid and can be omitted.

numactl is a tool used to control and manage the NUMA architecture on Linux. It can be installed using Yum.

yum install -y numactl numactl-devel

For example, numactl -C 0-79 -m 0 indicates that the TF Serving service runs on the cores of NUMA node 0, so that CPU resources can be fully utilized. -C and -m specify cores and memory of NUMA node 0, respectively.

Parent topic: Feature Guide