Architecture

Kunpeng BoostKit for Big Data supports multiple big data platforms and application scenarios such as offline analysis, real-time search, and real-time stream processing.

Real-time stream processing generally refers to real-time rapid data analysis to trigger next-step actions. Real-time data analysis has high requirements on the processing speed. In addition, due to the large amount of data, the requirements on CPU and memory are high. In comparison, not much storage capacity is required because the data does not need to be stored in most cases. Real-time processing is generally implemented through Storm, Spark Streaming, or Flink tasks. Its typical features are as follows:

High requirement on processing time (millisecond level)
Massive data to be processed (hundreds of megabytes per second)
A large number of compute resources occupied
Prone to compute resource preemption
Data mainly in network protocol formats
Relatively simple tasks
Isolation of data from clients, small storage capacity

The distributed message system Kafka sends collected data to distributed stream computing engines (Flink, Storm, and Spark Streaming) in real time for processing. Redis stores the results and provides caches for upper-layer services. Figure 1 shows the detailed system architecture.

Figure 1 Real-time big data stream processing architecture
Click to enlarge

**Table 1** Nodes in big data real-time stream processing scenarios
Name	Description
Data source	Include real-time stream data (such as Socket streams, OGG log streams, and log files), real-time files, and databases.
Real-time data collection system	Flume: collection tool provided by Hadoop. It supports data sources in various formats, including log files and network data streams. Third-party collection tools: dedicated third-party real-time data collection tools, including GoldenGate (real-time database collection) and self-developed collection programs (customized collection tools)
Message middleware	The message middleware caches real-time data and supports high-throughput message subscription and release. Kafka: distributed message system. It supports message production and release, and message caching in various forms, meeting the requirements of efficient and reliable message production and consumption.
Distributed stream computing engine	Quickly analyzes real-time data. Storm: open source distributed real-time compute engine. Storm can be used to reliably process infinite data flows. Flink: next-generation stream processing engine, which supports millisecond-level stream processing.
Data cache	Caches stream processing analysis results to meet the access requirements of stream processing applications. Redis: supports high-speed key-value storage and query capabilities for rapidly caching stream processing results.
Service applications	Service applications developed by ISVs for querying and using real-time stream processing results.

Parent topic: Real-Time Stream Processing