Big Data Compute Engine Parameters

Flink

Flink can run in standalone or YARN mode. The example test scenario uses the YARN mode. Note that the two modes differ in resource management, task scheduling, and concurrency control, resulting in different performance outcomes. In the production environment, choose the appropriate mode based on service requirements and cluster management policies, and optimize performance according to the tuning practices for that mode.

**Table 1** Flink service parameters
Recommended Parameter			Tuning Analysis
Parameter Name	Parameter Description	Value Range	Test Case	Importance Ratio (%)	Optimal Value (vs. Baseline)
state.backend.rocksdb.block.cache-size	RocksDB data block cache size	[2, 512]	Latency	10.8%	227	(+3.2%)
execution.checkpointing.interval	Interval for triggering Flink job state checkpoints	[10, 600000]		6.4%	489870
taskmanager.network.request-backoff.max	Maximum retry interval after a TaskManager network request failure	[1000, 20000]		6.1%	15256

Hive

Service parameters

To optimize task splitting and execution efficiency, it is recommended to set hive.auto.convert.join.noconditionaltask.size to a larger value and tez.grouping.min-size to a smaller value in the Hive startup settings. Set hive.stats.fetch.column.stats based on your service scenario and test cases. This parameter affects different test cases in different ways.

**Table 2** Hive service parameters
Recommended Parameter			Tuning Analysis
Parameter Name	Parameter Description	Value Range	Importance Ratio (%)	Optimal Value (vs. Baseline)		Worst Value (vs. Baseline)
tez.grouping.min-size	Minimum input data volume processed by a Tez task	{8000000, 16000000,32000000, 64000000,128000000, 256000000,512000000, 768000000,1024000000, 2048000000}	33.1%	16000000	(+30.8%)	768000000	(-57.9%)
hive.auto.convert.join.noconditionaltask.size	Maximum cumulative size of small tables for automatic broadcast	{0, 10000000, 50000000,100000000, 250000000,500000000, 750000000,1000000000}	30.8%	1000000000	(+30.8%)	0	(-57.9%)
hive.stats.fetch.column.stats	Column-level statistics obtained from Hive metastore	{true, false}	24.9%	-		-

System parameters

**Table 3** Hive system parameters
Recommended Parameter		Tuning Analysis
Parameter Name	Value Range	Importance Ratio (%)	Default Value	Optimal Value (vs. Baseline)
transparent_hugepage_mode	{madvise, never, always}	47.1%	never	always (+11.9%)

Spark

A Spark cluster can be deployed in YARN or standalone mode. The example test uses the YARN mode.

Service parameters

To improve parallelism and task scheduling efficiency, it is recommended to set spark.executor.instances and spark.sql.shuffle.partitions to larger values. For details about the recommended settings of other parameters, see Table 4.

**Table 4** Spark service parameters
Recommended Parameter			Tuning Analysis
Parameter Name	Parameter Description	Value Range	Importance Ratio (%)	Optimal Value (vs. Baseline)		Worst Value (vs. Baseline)
spark.executor.instances	Number of executors that execute tasks	[2, 48]	29.6%	45	(+28.7%)	2	(-3.5%)
spark.sql.adaptive.enabled	Adaptive query execution	{true, false}	20.0%	false		true
spark.sql.shuffle.partitions	Number of partitions generated by the Spark SQL shuffle operation	[100, 1000]	17.0%	775		259
spark.sql.autoBroadcastJoinThreshold	Maximum size of a small table that can be automatically broadcast in Spark SQL	{0, 10485760, 52428800,104857600, 209715200,314572800, 524288000,1048576000}	14.7%	0		1048576000
spark.executor.memory	Size of memory available for each Spark Executor process	[2, 32]	8.6%	4		30
spark.default.parallelism	Default number of partitions in a Resilient Distributed Dataset (RDD)	[200, 1600]	7.1%	1455		200

System parameters

**Table 5** Spark system parameters
Recommended Parameter		Tuning Analysis
Parameter Name	Value Range	Importance Ratio (%)	Default Value	Optimal Value (vs. Baseline)
transparent_hugepage_mode	{madvise, never, always}	20.5%	never	always (+4.2%)

Parent topic: Parameter Recommendation and Tuning