Hive Overview

Hive Workflow

Hive is a Hadoop-based data warehouse tool. It maps structured data files to tables and provides SQL-like query capabilities.

Figure 1 Hive workflow

The workflow is as follows:

A user submits a task such as a query to the driver.
The compiler obtains the task plan of the user.
The compiler obtains required Hive metadata from MetaStore based on the user task.
The compiler obtains the metadata information and compiles the task. It first converts HiveQL into an abstract syntax tree, the abstract syntax tree into a query block, and the query block into a logical query plan. Then, it rewrites the logical query plan and converts the logical plan into a physical plan (on the Tez engine). Finally, the compiler selects the optimal policy.
The compiler submits the final plan to the driver.
The driver transfers the plan to ExecutionEngine for execution. ExecutionEngine obtains the metadata information, and submits it to JobTracker or SourceManager to execute the task. The task directly reads files in HDFS and performs corresponding operations.
Obtain the execution results.
Obtain and return the execution results.

According to the preceding process, Hive execution is affected in the following two aspects:

Task compilation by the compiler: This aspect directly affects the query plan. Different query plans affect the actual physical plan (Tez).
Tez engine: This is the main body for executing Hive tasks.

Tez Engine

Tez is a directed acyclic graph (DAG) computing framework based on Hadoop Yarn. Built upon the MapReduce framework, Tez decomposes the map and reduce phases into finer-grained logical operations. That is, it splits the map phase into input, process, sort, merge, and output operations, and breaks down the reduce phase into input, shuffle, sort, merge, process, and output operations. These modular operations can be flexibly combined to generate new workflows. Some control programs orchestrate these components into a directed acyclic graph (DAG). The Tez computing framework is used to generate a simple DAG job. The operators do not exit after running. The operators used in a round are used in the next round. This greatly reduces drive I/O operations and improves the computing speed.

To sum up, Tez has the following features:

Tez runs on Yarn.
Tez is compatible with MapReduce and inherits all advantages of MapReduce (such as good scalability and fault tolerance).
Tez applies to DAG application scenarios.

Tez provides DAG programming interfaces at the bottom layer. Users can use these interfaces to write programs. The interfaces consist of the data processing engine and DAGAppMaster. The data processing engine provides a set of programming interfaces and data calculation operators. DAGAppMaster is a Yarn ApplicationMaster, and enables Tez applications to run on Yarn.

For example, the following Hive SQL statements are translated into four MapReduce jobs, and Tez is used to generate a DAG job, greatly reducing drive I/Os.

Parent topic: Introduction