Architecture
Kunpeng BoostKit for Big Data supports multiple big data platforms and application scenarios such as offline analysis, real-time search, and real-time stream processing.
Offline analysis generally indicates that massive volumes of data generated on the previous day are analyzed and processed offline and the results are provided for later use. Offline processing has low requirements on the processing time. However, a large amount of data needs to be processed, which occupies many compute and storage resources. Generally, offline processing is implemented by using MapReduce, Spark, or SQL jobs. The typical features are as follows:
- Low requirements on the processing time
- Up to petabytes of data to be processed
- Various data formats
- Complex scheduling of jobs
- A large number of compute and storage resources occupied
- Support for SQL jobs and custom jobs
- Prone to resource preemption
The offline analysis system uses the Hadoop Distributed File System (HDFS) storage as the data foundation. The compute engines are mainly based on MapReduce Hive and Spark SQL. Figure 1 shows the detailed system architecture.
|
Name |
Description |
|---|---|
|
Data source |
Includes stream data (such as Socket streams, OGG log streams, and log files), batch file data, and databases. |
|
Real-time data collection system |
|
|
Batch collection system |
|
|
Offline batch processing engines |
|
|
Service applications |
Service applications developed by independent software vendors (ISVs) for querying and using batch processing results. |
