Rate This Document
Findability
Accuracy
Completeness
Readability

Big Data Compute Engine Parameters

Flink

Flink can run in standalone or YARN mode. The example test scenario uses the YARN mode. Note that the two modes differ in resource management, task scheduling, and concurrency control, resulting in different performance outcomes. In the production environment, choose the appropriate mode based on service requirements and cluster management policies, and optimize performance according to the tuning practices for that mode.

Table 1 Flink service parameters

Recommended Parameter

Tuning Analysis

Parameter Name

Parameter Description

Value Range

Test Case

Importance Ratio (%)

Optimal Value (vs. Baseline)

state.backend.rocksdb.block.cache-size

RocksDB data block cache size

[2, 512]

Latency

10.8%

227

(+3.2%)

execution.checkpointing.interval

Interval for triggering Flink job state checkpoints

[10, 600000]

6.4%

489870

taskmanager.network.request-backoff.max

Maximum retry interval after a TaskManager network request failure

[1000, 20000]

6.1%

15256

Hive

  • Service parameters

    To optimize task splitting and execution efficiency, it is recommended to set hive.auto.convert.join.noconditionaltask.size to a larger value and tez.grouping.min-size to a smaller value in the Hive startup settings. Set hive.stats.fetch.column.stats based on your service scenario and test cases. This parameter affects different test cases in different ways.

    Table 2 Hive service parameters

    Recommended Parameter

    Tuning Analysis

    Parameter Name

    Parameter Description

    Value Range

    Importance Ratio (%)

    Optimal Value (vs. Baseline)

    Worst Value (vs. Baseline)

    tez.grouping.min-size

    Minimum input data volume processed by a Tez task

    {8000000, 16000000,32000000, 64000000,128000000, 256000000,512000000, 768000000,1024000000, 2048000000}

    33.1%

    16000000

    (+30.8%)

    768000000

    (-57.9%)

    hive.auto.convert.join.noconditionaltask.size

    Maximum cumulative size of small tables for automatic broadcast

    {0, 10000000, 50000000,100000000, 250000000,500000000, 750000000,1000000000}

    30.8%

    1000000000

    0

    hive.stats.fetch.column.stats

    Column-level statistics obtained from Hive metastore

    {true, false}

    24.9%

    -

    -

  • System parameters
    Table 3 Hive system parameters

    Recommended Parameter

    Tuning Analysis

    Parameter Name

    Value Range

    Importance Ratio (%)

    Default Value

    Optimal Value (vs. Baseline)

    transparent_hugepage_mode

    {madvise, never, always}

    47.1%

    never

    always (+11.9%)

Spark

A Spark cluster can be deployed in YARN or standalone mode. The example test uses the YARN mode.

  • Service parameters

    To improve parallelism and task scheduling efficiency, it is recommended to set spark.executor.instances and spark.sql.shuffle.partitions to larger values. For details about the recommended settings of other parameters, see Table 4.

    Table 4 Spark service parameters

    Recommended Parameter

    Tuning Analysis

    Parameter Name

    Parameter Description

    Value Range

    Importance Ratio (%)

    Optimal Value (vs. Baseline)

    Worst Value (vs. Baseline)

    spark.executor.instances

    Number of executors that execute tasks

    [2, 48]

    29.6%

    45

    (+28.7%)

    2

    (-3.5%)

    spark.sql.adaptive.enabled

    Adaptive query execution

    {true, false}

    20.0%

    false

    true

    spark.sql.shuffle.partitions

    Number of partitions generated by the Spark SQL shuffle operation

    [100, 1000]

    17.0%

    775

    259

    spark.sql.autoBroadcastJoinThreshold

    Maximum size of a small table that can be automatically broadcast in Spark SQL

    {0, 10485760, 52428800,104857600, 209715200,314572800, 524288000,1048576000}

    14.7%

    0

    1048576000

    spark.executor.memory

    Size of memory available for each Spark Executor process

    [2, 32]

    8.6%

    4

    30

    spark.default.parallelism

    Default number of partitions in a Resilient Distributed Dataset (RDD)

    [200, 1600]

    7.1%

    1455

    200

  • System parameters
    Table 5 Spark system parameters

    Recommended Parameter

    Tuning Analysis

    Parameter Name

    Value Range

    Importance Ratio (%)

    Default Value

    Optimal Value (vs. Baseline)

    transparent_hugepage_mode

    {madvise, never, always}

    20.5%

    never

    always (+4.2%)