Rate This Document
Findability
Accuracy
Completeness
Readability

Executing Hive Services

Hive uses the interactive CLI to execute SQL tasks. To check whether HiveExtension takes effect on Hive, add the EXPLAIN statement before the SQL statement. If the operator name starts with "Omni", HiveExtension has taken effect.

In this example, the tpcds_bin_partitioned_varchar_orc_2 data table is used as the test table and Q82 of the TPC-DS test dataset is used as the test SQL statement.

Table 1 lists the related table information.

Table 1 Table information

Table

Format

Total Number of Rows

item

orc

26000

inventory

orc

16966305

date_dim

orc

73049

store_sales

orc

5760749

  1. Start the Hive SQL CLI.
    • Commands for starting open source Hive SQL:
      hive --database tpcds_bin_partitioned_varchar_orc_2 --hiveconf hive.fetch.task.conversion=none --hiveconf hive.cbo.enable=true --hiveconf hive.exec.reducers.max=600 --hiveconf hive.exec.compress.intermediate=true --hiveconf hive.tez.container.size=8192 --hiveconf tez.am.resource.memory.mb=8192 --hiveconf tez.task.resource.memory.mb=8192 --hiveconf tez.runtime.io.sort.mb=128 --hiveconf hive.merge.tezfiles=true --hiveconf tez.am.container.reuse.enabled=true 
    • Run the following command to start HiveExtension:
      hive --database tpcds_bin_partitioned_varchar_orc_2 --hiveconf hive.fetch.task.conversion=none --hiveconf hive.cbo.enable=true --hiveconf hive.exec.reducers.max=600 --hiveconf hive.exec.compress.intermediate=true --hiveconf hive.tez.container.size=8192 --hiveconf tez.am.resource.memory.mb=8192 --hiveconf tez.task.resource.memory.mb=8192 --hiveconf tez.runtime.io.sort.mb=128 --hiveconf hive.merge.tezfiles=true --hiveconf tez.am.container.reuse.enabled=true --hiveconf hive.exec.pre.hooks=com.huawei.boostkit.hive.OmniExecuteWithHookContext --hiveconf tez.task.launch.env=OMNI_CONF=/opt/omni-operator/hive,LD_LIBRARY_PATH=/opt/omni-operator/lib 
    • In the preceding startup command, the OMNI_CONF path must be the same as the custom configuration file directory in Setting the Hive Configuration File.
    • If the dataset contains data of the String type and each single field of the data contains a large amount of data (more than 512 characters), you are advised to add the following configuration:
      --hiveconf omni.hive.string.length=2000
    • If a large Parquet dataset (for example, 10 TB) is used, you are advised to increase the value of hive.tez.container.size and other related parameters of the Tez task. The recommended startup command is as follows:
      hive --database tpcds_bin_partitioned_varchar_parquet_10000 --hiveconf tez.task.launch.env=OMNI_CONF=/opt/omni-operator/hive,LD_LIBRARY_PATH=/opt/omni-operator/lib --hiveconf hive.vectorized.execution.enabled=true --hiveconf hive.cbo.enable=true --hiveconf hive.exec.reducers.max=600 --hiveconf hive.exec.compress.intermediate=true --hiveconf hive.tez.container.size=61440 --hiveconf tez.am.resource.memory.mb=61440 --hiveconf tez.task.resource.memory.mb=61440 --hiveconf mapreduce.reduce.java.opts=-Xmx49152m --hiveconf mapreduce.map.java.opts=-Xmx49152m --hiveconf tez.runtime.io.sort.mb=128 --hiveconf hive.merge.tezfiles=true --hiveconf tez.am.container.reuse.enabled=true --hiveconf hive.exec.pre.hooks=com.huawei.boostkit.hive.OmniExecuteWithHookContext --hiveconf tez.container.max.java.heap.fraction=0.5
    • The preceding startup command is used in Yarn mode. If you want to use the local mode instead, add the following parameters to the end of the startup command:
      --hiveconf tez.local.mode=true --hiveconf tez.runtime.optimize.local.fetch=true

    Table 2 describes the HiveExtension startup parameters.

    Table 2 HiveExtension startup parameters

    Parameter

    Default Value

    Description

    hive.exec.pre.hooks

    com.huawei.boostkit.hive.OmniExecuteWithHookContext

    Starts HiveExtension.

    tez.task.launch.env

    OMNI_CONF=/opt/omni-operator/hive,LD_LIBRARY_PATH=/opt/omni-operator/lib

    Sets the environment of the Tez task process.

    hive.fetch.task.conversion

    more

    Indicates whether to convert Hive data queries into MapReduce tasks. more: MapReduce is executed for part of tasks; none: MapReduce is executed for all tasks.

    hive.cbo.enable

    true

    Indicates whether to enable the cost-based optimizer. true: yes; false: no.

    hive.exec.reducers.max

    1009

    Sets the maximum number of reducers.

    hive.exec.compress.intermediate

    false

    Indicates whether to compress intermediate files generated by Hive between MapReduce tasks. true: yes; false: no.

    hive.tez.container.size

    -1

    Sets the default size of a container generated by a Tez task, in MB.

    tez.am.resource.memory.mb

    1024

    Sets the memory size used by the ApplicationMaster corresponding to each Tez task in the cluster, in MB.

    tez.task.resource.memory.mb

    1024

    Sets the memory size used by started tasks in a Tez container, in MB.

    tez.runtime.io.sort.mb

    100

    Sets the size of the output sorting memory, in MB.

    hive.merge.tezfiles

    false

    Indicates whether to merge the files generated by a Tez task. true: yes; false: no.

    tez.am.container.reuse.enabled

    true

    Indicates whether to reuse containers. true: yes; false: no.

    tez.local.mode

    false

    Indicates whether to enable the local mode. true: yes; false: no.

    tez.runtime.optimize.local.fetch

    true

    Indicates whether to optimize the file logic in local mode. true: yes; false: no.

    hive.vectorized.execution.enabled

    true

    Indicates whether to enable Hive vectorization. true: yes; false: no.

    mapreduce.map.java.opts

    -Xmx200m

    Sets the maximum heap memory that can be used by the Map task.

    mapreduce.reduce.java.opts

    -Xmx200m

    Sets the maximum heap memory that can be used by the Reduce task.

    tez.container.max.java.heap.fraction

    0.8

    Sets the ratio of the heap memory available to the Tez main task process to the container memory.

    omni.hive.string.length

    -

    Sets the maximum String data length supported by HiveExtension.

    omni.hive.adaptivePartialAggregation.enabled

    false

    Enables adaptive skipping of the HashAgg group aggregation operation in the partial stage. This optimization is performed during running. The partial stage of group aggregation is skipped and data is directly output to the downstream operator if the sampling scenario is identified as a high cardinality scenario and group aggregation is performed.

    omni.hive.adaptivePartialAggregationMinRows

    500000

    Specifies the minimum number of rows sampled for adaptivePartialAggregation optimization. When this number has been reached, the tool calculates the aggregation of the sampled data.

    omni.hive.adaptivePartialAggregationRatio

    0.8

    Specifies the minimum aggregation threshold for adaptivePartialAggregation optimization. If the aggregation of sampled data has reached the threshold, this type of optimization is applied.

  2. Check whether HiveExtension takes effect.

    Run the following SQL statement in the HiveExtension CLI and open source Hive SQL CLI:

    explain select  i_item_id
        ,i_item_desc
        ,i_current_price
    from item, inventory, date_dim, store_sales
    where i_current_price between 76 and 76+30
    and inv_item_sk = i_item_sk
    and d_date_sk=inv_date_sk
    and d_date between cast('1998-06-29' as date) and cast('1998-08-29' as date)
    and i_manufact_id in (512,409,677,16)
    and inv_quantity_on_hand between 100 and 500
    and ss_item_sk = i_item_sk
    group by i_item_id,i_item_desc,i_current_price
    order by i_item_id
    limit 100;

    The following figure shows the execution plan output in the HiveExtension CLI. If the operator name starts with "Omni", HiveExtension has taken effect.

    Execution plan outputted by open source Hive SQL

  3. Run the following SQL statement.

    Run the following SQL statement in the HiveExtension CLI and open source Hive SQL CLI:

     select  i_item_id
        ,i_item_desc
        ,i_current_price
    from item, inventory, date_dim, store_sales
    where i_current_price between 76 and 76+30
    and inv_item_sk = i_item_sk
    and d_date_sk=inv_date_sk
    and d_date between cast('1998-06-29' as date) and cast('1998-08-29' as date)
    and i_manufact_id in (512,409,677,16)
    and inv_quantity_on_hand between 100 and 500
    and ss_item_sk = i_item_sk
    group by i_item_id,i_item_desc,i_current_price
    order by i_item_id
    limit 100;
  4. Compare the results.
    • Execution result of open source Hive SQL:

      The execution plan is as follows:

    • HiveExtension execution result

      The execution plan is as follows:

    • Result comparison: The SQL output data of open source Hive SQL is the same as that of HiveExtension.