我要评分
获取效率
正确性
完整性
易理解

Architecture

Kunpeng BoostKit for Big Data supports multiple big data platforms and application scenarios such as offline analysis, real-time search, and real-time stream processing.

Real-time stream processing generally refers to real-time rapid data analysis to trigger next-step actions. Real-time data analysis has high requirements on the processing speed. In addition, due to the large amount of data, the requirements on CPU and memory are high. In comparison, not much storage capacity is required because the data does not need to be stored in most cases. Real-time processing is generally implemented through Storm, Spark Streaming, or Flink tasks. Its typical features are as follows:

  • High requirement on processing time (millisecond level)
  • Massive data to be processed (hundreds of megabytes per second)
  • A large number of compute resources occupied
  • Prone to compute resource preemption
  • Data mainly in network protocol formats
  • Relatively simple tasks
  • Isolation of data from clients, small storage capacity

The distributed message system Kafka sends collected data to distributed stream computing engines (Flink, Storm, and Spark Streaming) in real time for processing. Redis stores the results and provides caches for upper-layer services. Figure 1 shows the detailed system architecture.

Figure 1 Real-time big data stream processing architecture
Table 1 Nodes in big data real-time stream processing scenarios

Name

Description

Data source

Include real-time stream data (such as Socket streams, OGG log streams, and log files), real-time files, and databases.

Real-time data collection system

  • Flume: collection tool provided by Hadoop. It supports data sources in various formats, including log files and network data streams.
  • Third-party collection tools: dedicated third-party real-time data collection tools, including GoldenGate (real-time database collection) and self-developed collection programs (customized collection tools)

Message middleware

The message middleware caches real-time data and supports high-throughput message subscription and release.

Kafka: distributed message system. It supports message production and release, and message caching in various forms, meeting the requirements of efficient and reliable message production and consumption.

Distributed stream computing engine

Quickly analyzes real-time data.

  • Storm: open source distributed real-time compute engine. Storm can be used to reliably process infinite data flows.
  • Flink: next-generation stream processing engine, which supports millisecond-level stream processing.

Data cache

Caches stream processing analysis results to meet the access requirements of stream processing applications.

Redis: supports high-speed key-value storage and query capabilities for rapidly caching stream processing results.

Service applications

Service applications developed by ISVs for querying and using real-time stream processing results.