Rate This Document
Findability
Accuracy
Completeness
Readability

Tuning a Hive Task

Use the OmniAdvisor feature to obtain the optimal running parameters of Hive tasks and optimize task performance.

  1. Modify the list of parameters to be tuned, default parameter values, and parameter ranges.
    1. Open the $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/config/hive/hive_config.yml file.
      vi $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/config/hive/hive_config.yml
    2. Press i to enter the insert mode. Add, delete, or retain the parameters to be tuned as needed. In addition, set the parameter name, value range, default value, data type, and unit.

      As an example, Table 1 describes the configuration items of the hive.exec.reducers.max parameter.

      Table 1 Configuration items in hive_config.yml

      Configuration Item

      Description

      hive.exec.reducers.max

      Name of the Hive configuration parameter to be tuned. It specifies the maximum number of Reduce tasks that can be started during a Hive query.

      choices

      Parameter value range. During parameter tuning, the algorithm selects a value from the value range defined by choices. The value range usually uses default_value as the median value and stretches based on the available resources.

      default_value

      Default parameter value. You can set this configuration item based on your actual service requirements. The default parameter value must be included in the value range defined by choices. Generally it is set to the median value of choices.

      type

      Data type, which is int, boolean, or float.

      unit

      Can be K, M, and G, which indicate KB, MB, and GB respectively. Generally, GB is used by default.

      The common configuration items are as follows. The parameter values are for reference only. You can adjust the values of choices and default_value or add or delete parameters involved in the tuning process based on your service scenario and available resources.

      hive.exec.reducers.max: # Configure the maximum number of Reduce tasks that can be started during a Hive query.
        choices: [ 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500 ] # Configure the parameter value range.
        default_value: 600 # Configure the default parameter value.
        type: int # Configure the data type, which is int, boolean, or float.
      
      hive.tez.container.size: # Configure the size of a Tez container.
        choices: [ 5120, 6144, 7168, 8192, 9216, 10240, 11264, 12288, 13312, 14336, 15360, 16384, 17408, 18432, 19456, 20480 ]
        default_value: 8192
        type: int
      
      tez.runtime.io.sort.mb: # Configure the I/O sorting memory size for running Tez.
        choices: [ 32, 64, 96, 128, 160, 192, 224, 256, 288, 320, 352 ]
        default_value: 64
        type: int
      
      tez.am.resource.memory.mb: # Configure the memory size of the Tez Application Manager.
        choices: [ 3072, 4096, 5120, 6144, 7168, 8192, 9216, 10240 ]
        default_value: 3072
        type: int
      
      tez.grouping.split-count: # Configure the number of tasks generated in a Tez DAG.
        choices: [ 500, 1000, 1500, 2000, 2500, 3000 ]
        default_value: 1500
        type: int
      
      tez.grouping.min-size: # Configure the minimum size of an input split processed in each task.
        choices: [ 8000000, 16000000, 32000000, 64000000 ]
        default_value: 8000000
        type: int
      
      tez.grouping.max-size: # Configure the maximum size of an input split processed in each task.
        choices: [ 64000000, 128000000, 256000000, 512000000, 768000000, 1024000000 ]
        default_value: 1024000000
        type: int
      
      hive.stats.fetch.column.stats: # Set whether to enable the Hive query optimizer to extract column-level statistics to help optimize the query plan.
        choices: [ "true", "false" ]
        default_value: "true"
        type: boolean
      
      hive.auto.convert.join: # Set whether to enable automatic join optimization for Hive.
        choices: [ "true", "false" ]
        default_value: "true"
        type: boolean
      
      hive.optimize.skewjoin: # Set whether to enable skewed join optimization for Hive.
        choices: [ "true", "false" ]
        default_value: "true"
        type: boolean
      
      hive.exec.compress.output: # Set whether the final output needs to be compressed.
        choices: [ "true", "false" ]
        default_value: "true"
        type: boolean
      
      hive.exec.compress.intermediate: #  Set whether to compress temporary files in between multiple MapReduce jobs.
        choices: [ "true", "false" ]
        default_value: "true"
        type: boolean
      
      hive.exec.parallel.thread.number: # Configure the number of concurrent tasks.
        choices: [ 4, 8, 16, 32, 64, 96 ]
        default_value: 8
        type: int
      
      hive.auto.convert.join.noconditionaltask: # Set whether to allow converting common joins to Map joins.
        choices: [ "true", "false" ]
        default_value: "true"
        type: boolean
      
      hive.auto.convert.join.noconditionaltask.size: # Enable hive.auto.convert.join.noconditionaltask to determine the threshold for converting to Map joins.
        choices: [ 10000000, 100000000, 1000000000, 2000000000, 5000000000, 6000000000 ]
        default_value: 10000000
        type: int
      
      hive.limit.optimize.enable: # Set whether to enable limit optimization.
        choices: [ "true", "false" ]
        default_value: "true"
        type: boolean
    3. Press Esc, type :wq!, and press Enter to save the file and exit.
  2. Confirm the default Hive conf configuration.
    1. Tune the Hive SQL task (TPC-DS SQL12):
      1
      hive --hiveconf hive.cbo.enable=true --hiveconf hive.exec.reducers.max=600 --hiveconf hive.exec.compress.intermediate=true --hiveconf hive.tez.container.size=8192 --hiveconf tez.am.resource.memory.mb=8192 --hiveconf tez.task.resource.memory.mb=8192 --hiveconf tez.runtime.io.sort.mb=128 --hiveconf hive.merge.tezfiles=true --hiveconf tez.am.container.reuse.enabled=true --hiveconf hive.session.id=sql12 --database tpcds_bin_partitioned_decimal_orc_100 -f /home/hive-tpcds/sql12.sql
      
    2. The task configuration includes tuned parameters and non-tuned parameters. Tuned parameters refer to the parameters that need to be tuned. Non-tuned parameters include the parameters that cannot be or do not need to be tuned.
      • For the preceding Hive SQL task, the tuned parameters are:
        1
        --hiveconf hive.exec.reducers.max=600 --hiveconf hive.exec.compress.intermediate=true --hiveconf hive.tez.container.size=8192 --hiveconf tez.am.resource.memory.mb=8192 --hiveconf tez.task.resource.memory.mb=8192 --hiveconf tez.runtime.io.sort.mb=128
        
      • The non-tuned parameters are:
        1
        --hiveconf hive.cbo.enable=true --hiveconf tez.am.container.reuse.enabled=true --hiveconf hive.merge.tezfiles=true
        
    3. Add the non-tuned parameter to the $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/config/common_config.cfg file.
      1. Open the $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/config/common_config.cfg file.
        vi $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/config/common_config.cfg
      2. Press i to enter the insert mode and add non-tuned parameter configuration to the hive_default_config field.
        1
        2
        # Default Hive parameter. Generally, the default parameter is not involved in parameter sampling.
        hive_default_config = --hiveconf hive.cbo.enable=true --hiveconf tez.am.container.reuse.enabled=true --hiveconf hive.merge.tezfiles=true
        
      3. Press Esc, type :wq!, and press Enter to save the file and exit.
  3. On the management node, initialize the database and synchronize the parameter configuration to the log parsing module.
    1. Use the OmniAdvisor CLI to choose the Hive engine.

      python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc

    2. Input init_environment or press Tab to select init_environment, and then press Enter.
      • After the command is run, tables history_config and best_config are created in the test_advisor database.
      • In this step, the tuned parameters in hive_config.yml are synchronized to the configuration of the log parsing module. If the Hive tuned parameters have a change, you need to run the init_environment command again to synchronize the new parameter setting to the log parsing module.
  4. Invoke the log parsing module to write the parsed data into the database.
    1. Open the $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/config/common_config.cfg configuration file.
      vi $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/config/common_config.cfg
    2. Press i to enter the insert mode and modify the start time and end time of the log. For details about the common_config.cfg file, see common_config.cfg.
      1
      2
      3
      4
      5
      [hive]
      # Start time of Tez run logs. You can view the date on the Hadoop UI.
      log_start_time = 2023-09-14 19:12:45
      # End time of Tez run logs.
      log_end_time = 2023-09-14 19:19:45
      
    3. Press Esc, type :wq!, and press Enter to save the file and exit.
    4. Run the collection command.
      Use the OmniAdvisor CLI to choose the Hive engine.
      python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc

      Input fetch_history_data or press Tab to select fetch_history_data, and then press Enter.

      After the historical task information is parsed, the result is written into the history_config and best_config tables.

  5. Sample parameters of historical tasks.
    1. Use the OmniAdvisor CLI to choose the Hive engine.

      python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc

    2. Input parameter_sampling or press Tab to select parameter_sampling, and then press Enter.

    3. Enter a number to specify the number of parameter sampling rounds.

  6. Tune parameters.
    • Input yes to sample parameters of all tunable tasks in the database for n rounds. Wait until the sampling is complete.

    • If you input no, the listed tasks are filtered. Input the identification values of the tasks to tuned. Use commas (,) to separate multiple task identification values. Press Enter to sample task parameters for n rounds. Wait until the sampling is complete.

    • Each time a task parameter is sampled, the program invokes the log parsing module to parse task information such as the task status and task running time, saves the information to the history_config table, and updates the optimal configuration in the best_config table.
    • You can recommend parameters for a task only after parameter sampling is complete.
  7. Recommend executing the optimal parameters in the sampling to execute the task.
    1. Use the OmniAdvisor CLI to choose the Hive engine.

      python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc

    2. Input parameter_recommend or press Tab to select parameter_recommend, and then press Enter.

    3. Input the command of submitting the Hive task to be tuned, and submit the Hive task through OmniAdvisor.

      The following uses sql12.sql as an example for an SQL task:

    4. Use OmniAdvisor to recommend and submit parameters.
      python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc -e hive -i parameter_recommend -c "hive --hiveconf hive.cbo.enable=true --hiveconf hive.exec.reducers.max=600 --hiveconf hive.exec.compress.intermediate=true --hiveconf hive.tez.container.size=8192 --hiveconf tez.am.resource.memory.mb=8192 --hiveconf tez.task.resource.memory.mb=8192 --hiveconf tez.runtime.io.sort.mb=128 --hiveconf hive.merge.tezfiles=true --hiveconf tez.am.container.reuse.enabled=true --hiveconf hive.session.id=sql12 --database tpcds_bin_partitioned_decimal_orc_100 -f /home/hive_sql/sample-queries-tpcds/query12.sql"
    • During parameter recommendation, the task identification value is calculated based on identification_type in the configuration, the optimal parameters in best_config are matched to replace the original parameters, and the new parameters submitted to Hive for execution.
    • If no optimal parameters are matched in the best_config table or the matched parameters fail to be executed, the task is executed using the original parameters.