Component Principles

Hive

The Hive engine converts the SQL jobs submitted by users into MapReduce jobs. HDFS data is accessed under the scheduling of Yarn. The entire system is presented as an SQL database to the outside. Figure 1 shows the component architecture of Hive.

Figure 1 Hive component architecture

Hive uses Yarn as the resource scheduling system. Resources can be configured in multiple modes such as proportion and absolute value. Additionally, resources can be isolated by physical nodes.
Supports linear node expansion, which has low requirements on hardware.
Supports multiple data formats such as TXT, Sequence, ORC, and Parquet, and supports data compression and encryption.

Spark

The Spark SQL engine converts the SQL jobs submitted by users into Spark jobs. HDFS data is accessed under the scheduling of Yarn. The entire system is presented as an SQL database to the outside. Figure 2 shows the component architecture of Spark.

Figure 2 Spark component architecture

Spark and MapReduce are the most basic distributed computing frameworks in Hadoop. It is used to design non-SQL batch processing jobs, such as complex mining and machine learning. The difference is that Spark mainly depends on memory iteration whereas MapReduce depends on HDFS to store intermediate result data. Spark has the following features:

Compared with MapReduce, Spark has the following features:

The memory iteration is fast, which is 5 to 10 times faster than that of MapReduce.
Spark is equipped with multiple built-in functions and algorithms, supporting data mining and statistical analysis algorithm libraries including MLlib, Mahout, and beyond.
It has high requirements on hardware and memory capacity.

Parent topic: Offline Analysis