Architecture

Kunpeng BoostKit for Big Data supports multiple big data platforms and application scenarios such as offline analysis, real-time search, and real-time stream processing.

Offline analysis generally indicates that massive volumes of data generated on the previous day are analyzed and processed offline and the results are provided for later use. Offline processing has low requirements on the processing time. However, a large amount of data needs to be processed, which occupies many compute and storage resources. Generally, offline processing is implemented by using MapReduce, Spark, or SQL jobs. The typical features are as follows:

Low requirements on the processing time
Up to petabytes of data to be processed
Various data formats
Complex scheduling of jobs
A large number of compute and storage resources occupied
Support for SQL jobs and custom jobs
Prone to resource preemption

The offline analysis system uses the Hadoop Distributed File System (HDFS) storage as the data foundation. The compute engines are mainly based on MapReduce Hive and Spark SQL. Figure 1 shows the detailed system architecture.

Figure 1 Big data offline computing architecture

**Table 1** Nodes in big data offline scenarios
Name	Description
Data source	Includes stream data (such as Socket streams, OGG log streams, and log files), batch file data, and databases.
Real-time data collection system	Flume: collects data such as Socket streams and log files. Third-party collection tools: third-party or customized data collection tools or programs. The common mode is to collect data, send the data to the Kafka+Spark Streaming for preprocessing, and load the data in real time.
Batch collection system	Flume: collects data files and log files in batches. Sqoop: collects database data in batches. Third-party collection/extract, transform, load (ETL) tools: third-party data collection, loading, and processing tools
Offline batch processing engines	Hive: traditional SQL batch processing engine that is used to process SQL batch processing jobs. Its performance is stable when a large amount of data is processed, but the processing speed is slow. MapReduce: traditional batch processing engine that is used to process non-SQL batch processing jobs, especially batch processing jobs of data mining and machine learning. It is widely used and has stable performance when a large amount of data is processed, but the processing speed is slow. Spark SQL: new SQL batch processing engine that is used to process SQL batch processing jobs. This engine is applicable to scenarios with mass data. Its processing speed is fast. Spark: new batch processing engine that is used to process non-SQL batch processing jobs, especially batch processing jobs of data mining and machine learning. This engine is applicable to scenarios with mass data. Its processing speed is fast. Yarn: resource scheduling engine that provides resource scheduling capabilities for various batch processing engines. This engine is the basis of multi-tenant resource allocation. HDFS: distributed file system that provides data storage service for various batch processing engines and can store data of various file formats.
Service applications	Service applications developed by independent software vendors (ISVs) for querying and using batch processing results.

Parent topic: Offline Analysis