Executing Hive Services
Hive uses the interactive CLI to execute SQL tasks. To check whether HiveExtension takes effect on Hive, add the EXPLAIN statement before the SQL statement. If the operator name starts with "Omni", HiveExtension has taken effect.
In this example, the tpcds_bin_partitioned_varchar_orc_2 data table is used as the test table and Q82 of the TPC-DS test dataset is used as the test SQL statement.
Table 1 lists the related table information.
|
Table |
Format |
Total Number of Rows |
|---|---|---|
|
item |
orc |
26000 |
|
inventory |
orc |
16966305 |
|
date_dim |
orc |
73049 |
|
store_sales |
orc |
5760749 |
- Start the Hive SQL CLI.
- Commands for starting open source Hive SQL:
hive --database tpcds_bin_partitioned_varchar_orc_2 --hiveconf hive.fetch.task.conversion=none --hiveconf hive.cbo.enable=true --hiveconf hive.exec.reducers.max=600 --hiveconf hive.exec.compress.intermediate=true --hiveconf hive.tez.container.size=8192 --hiveconf tez.am.resource.memory.mb=8192 --hiveconf tez.task.resource.memory.mb=8192 --hiveconf tez.runtime.io.sort.mb=128 --hiveconf hive.merge.tezfiles=true --hiveconf tez.am.container.reuse.enabled=true
- Run the following command to start HiveExtension:
hive --database tpcds_bin_partitioned_varchar_orc_2 --hiveconf hive.fetch.task.conversion=none --hiveconf hive.cbo.enable=true --hiveconf hive.exec.reducers.max=600 --hiveconf hive.exec.compress.intermediate=true --hiveconf hive.tez.container.size=8192 --hiveconf tez.am.resource.memory.mb=8192 --hiveconf tez.task.resource.memory.mb=8192 --hiveconf tez.runtime.io.sort.mb=128 --hiveconf hive.merge.tezfiles=true --hiveconf tez.am.container.reuse.enabled=true --hiveconf hive.exec.pre.hooks=com.huawei.boostkit.hive.OmniExecuteWithHookContext --hiveconf tez.task.launch.env=OMNI_CONF=/opt/omni-operator/hive,LD_LIBRARY_PATH=/opt/omni-operator/lib
- In the preceding startup command, the OMNI_CONF path must be the same as the custom configuration file directory in Setting the Hive Configuration File.
- If the dataset contains data of the String type and each single field of the data contains a large amount of data (more than 512 characters), you are advised to add the following configuration:
--hiveconf omni.hive.string.length=2000
- If a large Parquet dataset (for example, 10 TB) is used, you are advised to increase the value of hive.tez.container.size and other related parameters of the Tez task. The recommended startup command is as follows:
hive --database tpcds_bin_partitioned_varchar_parquet_10000 --hiveconf tez.task.launch.env=OMNI_CONF=/opt/omni-operator/hive,LD_LIBRARY_PATH=/opt/omni-operator/lib --hiveconf hive.vectorized.execution.enabled=true --hiveconf hive.cbo.enable=true --hiveconf hive.exec.reducers.max=600 --hiveconf hive.exec.compress.intermediate=true --hiveconf hive.tez.container.size=61440 --hiveconf tez.am.resource.memory.mb=61440 --hiveconf tez.task.resource.memory.mb=61440 --hiveconf mapreduce.reduce.java.opts=-Xmx49152m --hiveconf mapreduce.map.java.opts=-Xmx49152m --hiveconf tez.runtime.io.sort.mb=128 --hiveconf hive.merge.tezfiles=true --hiveconf tez.am.container.reuse.enabled=true --hiveconf hive.exec.pre.hooks=com.huawei.boostkit.hive.OmniExecuteWithHookContext --hiveconf tez.container.max.java.heap.fraction=0.5
- The preceding startup command is used in Yarn mode. If you want to use the local mode instead, add the following parameters to the end of the startup command:
--hiveconf tez.local.mode=true --hiveconf tez.runtime.optimize.local.fetch=true
Table 2 describes the HiveExtension startup parameters.
Table 2 HiveExtension startup parameters Parameter
Default Value
Description
hive.exec.pre.hooks
com.huawei.boostkit.hive.OmniExecuteWithHookContext
Starts HiveExtension.
tez.task.launch.env
OMNI_CONF=/opt/omni-operator/hive,LD_LIBRARY_PATH=/opt/omni-operator/lib
Sets the environment of the Tez task process.
hive.fetch.task.conversion
more
Indicates whether to convert Hive data queries into MapReduce tasks. more: MapReduce is executed for part of tasks; none: MapReduce is executed for all tasks.
hive.cbo.enable
true
Indicates whether to enable the cost-based optimizer. true: yes; false: no.
hive.exec.reducers.max
1009
Sets the maximum number of reducers.
hive.exec.compress.intermediate
false
Indicates whether to compress intermediate files generated by Hive between MapReduce tasks. true: yes; false: no.
hive.tez.container.size
-1
Sets the default size of a container generated by a Tez task, in MB.
tez.am.resource.memory.mb
1024
Sets the memory size used by the ApplicationMaster corresponding to each Tez task in the cluster, in MB.
tez.task.resource.memory.mb
1024
Sets the memory size used by started tasks in a Tez container, in MB.
tez.runtime.io.sort.mb
100
Sets the size of the output sorting memory, in MB.
hive.merge.tezfiles
false
Indicates whether to merge the files generated by a Tez task. true: yes; false: no.
tez.am.container.reuse.enabled
true
Indicates whether to reuse containers. true: yes; false: no.
tez.local.mode
false
Indicates whether to enable the local mode. true: yes; false: no.
tez.runtime.optimize.local.fetch
true
Indicates whether to optimize the file logic in local mode. true: yes; false: no.
hive.vectorized.execution.enabled
true
Indicates whether to enable Hive vectorization. true: yes; false: no.
mapreduce.map.java.opts
-Xmx200m
Sets the maximum heap memory that can be used by the Map task.
mapreduce.reduce.java.opts
-Xmx200m
Sets the maximum heap memory that can be used by the Reduce task.
tez.container.max.java.heap.fraction
0.8
Sets the ratio of the heap memory available to the Tez main task process to the container memory.
omni.hive.string.length
-
Sets the maximum String data length supported by HiveExtension.
omni.hive.adaptivePartialAggregation.enabled
false
Enables adaptive skipping of the HashAgg group aggregation operation in the partial stage. This optimization is performed during running. The partial stage of group aggregation is skipped and data is directly output to the downstream operator if the sampling scenario is identified as a high cardinality scenario and group aggregation is performed.
omni.hive.adaptivePartialAggregationMinRows
500000
Specifies the minimum number of rows sampled for adaptivePartialAggregation optimization. When this number has been reached, the tool calculates the aggregation of the sampled data.
omni.hive.adaptivePartialAggregationRatio
0.8
Specifies the minimum aggregation threshold for adaptivePartialAggregation optimization. If the aggregation of sampled data has reached the threshold, this type of optimization is applied.
- Commands for starting open source Hive SQL:
- Check whether HiveExtension takes effect.
Run the following SQL statement in the HiveExtension CLI and open source Hive SQL CLI:
explain select i_item_id ,i_item_desc ,i_current_price from item, inventory, date_dim, store_sales where i_current_price between 76 and 76+30 and inv_item_sk = i_item_sk and d_date_sk=inv_date_sk and d_date between cast('1998-06-29' as date) and cast('1998-08-29' as date) and i_manufact_id in (512,409,677,16) and inv_quantity_on_hand between 100 and 500 and ss_item_sk = i_item_sk group by i_item_id,i_item_desc,i_current_price order by i_item_id limit 100;The following figure shows the execution plan output in the HiveExtension CLI. If the operator name starts with "Omni", HiveExtension has taken effect.

Execution plan outputted by open source Hive SQL

- Run the following SQL statement.
Run the following SQL statement in the HiveExtension CLI and open source Hive SQL CLI:
select i_item_id ,i_item_desc ,i_current_price from item, inventory, date_dim, store_sales where i_current_price between 76 and 76+30 and inv_item_sk = i_item_sk and d_date_sk=inv_date_sk and d_date between cast('1998-06-29' as date) and cast('1998-08-29' as date) and i_manufact_id in (512,409,677,16) and inv_quantity_on_hand between 100 and 500 and ss_item_sk = i_item_sk group by i_item_id,i_item_desc,i_current_price order by i_item_id limit 100; - Compare the results.



