Big Data Compute Engine Parameters
Flink
Flink can run in standalone or YARN mode. The example test scenario uses the YARN mode. Note that the two modes differ in resource management, task scheduling, and concurrency control, resulting in different performance outcomes. In the production environment, choose the appropriate mode based on service requirements and cluster management policies, and optimize performance according to the tuning practices for that mode.
Recommended Parameter |
Tuning Analysis |
|||||
|---|---|---|---|---|---|---|
Parameter Name |
Parameter Description |
Value Range |
Test Case |
Importance Ratio (%) |
Optimal Value (vs. Baseline) |
|
state.backend.rocksdb.block.cache-size |
RocksDB data block cache size |
[2, 512] |
Latency |
10.8% |
227 |
(+3.2%) |
execution.checkpointing.interval |
Interval for triggering Flink job state checkpoints |
[10, 600000] |
6.4% |
489870 |
||
taskmanager.network.request-backoff.max |
Maximum retry interval after a TaskManager network request failure |
[1000, 20000] |
6.1% |
15256 |
||
Hive
- Service parameters
To optimize task splitting and execution efficiency, it is recommended to set hive.auto.convert.join.noconditionaltask.size to a larger value and tez.grouping.min-size to a smaller value in the Hive startup settings. Set hive.stats.fetch.column.stats based on your service scenario and test cases. This parameter affects different test cases in different ways.
Table 2 Hive service parameters Recommended Parameter
Tuning Analysis
Parameter Name
Parameter Description
Value Range
Importance Ratio (%)
Optimal Value (vs. Baseline)
Worst Value (vs. Baseline)
tez.grouping.min-size
Minimum input data volume processed by a Tez task
{8000000, 16000000,32000000, 64000000,128000000, 256000000,512000000, 768000000,1024000000, 2048000000}
33.1%
16000000
(+30.8%)
768000000
(-57.9%)
hive.auto.convert.join.noconditionaltask.size
Maximum cumulative size of small tables for automatic broadcast
{0, 10000000, 50000000,100000000, 250000000,500000000, 750000000,1000000000}
30.8%
1000000000
0
hive.stats.fetch.column.stats
Column-level statistics obtained from Hive metastore
{true, false}
24.9%
-
-
- System parameters
Table 3 Hive system parameters Recommended Parameter
Tuning Analysis
Parameter Name
Value Range
Importance Ratio (%)
Default Value
Optimal Value (vs. Baseline)
transparent_hugepage_mode
{madvise, never, always}
47.1%
never
always (+11.9%)
Spark
A Spark cluster can be deployed in YARN or standalone mode. The example test uses the YARN mode.
- Service parameters
To improve parallelism and task scheduling efficiency, it is recommended to set spark.executor.instances and spark.sql.shuffle.partitions to larger values. For details about the recommended settings of other parameters, see Table 4.
Table 4 Spark service parameters Recommended Parameter
Tuning Analysis
Parameter Name
Parameter Description
Value Range
Importance Ratio (%)
Optimal Value (vs. Baseline)
Worst Value (vs. Baseline)
spark.executor.instances
Number of executors that execute tasks
[2, 48]
29.6%
45
(+28.7%)
2
(-3.5%)
spark.sql.adaptive.enabled
Adaptive query execution
{true, false}
20.0%
false
true
spark.sql.shuffle.partitions
Number of partitions generated by the Spark SQL shuffle operation
[100, 1000]
17.0%
775
259
spark.sql.autoBroadcastJoinThreshold
Maximum size of a small table that can be automatically broadcast in Spark SQL
{0, 10485760, 52428800,104857600, 209715200,314572800, 524288000,1048576000}
14.7%
0
1048576000
spark.executor.memory
Size of memory available for each Spark Executor process
[2, 32]
8.6%
4
30
spark.default.parallelism
Default number of partitions in a Resilient Distributed Dataset (RDD)
[200, 1600]
7.1%
1455
200
- System parameters
Table 5 Spark system parameters Recommended Parameter
Tuning Analysis
Parameter Name
Value Range
Importance Ratio (%)
Default Value
Optimal Value (vs. Baseline)
transparent_hugepage_mode
{madvise, never, always}
20.5%
never
always (+4.2%)