Component Principles
Flink
Flink is a unified computing framework that supports both batch and stream processing. It provides a stream data processing engine that supports data distribution and parallel computing. Flink features stream processing and is a top open source stream processing engine in the industry. Figure 1 shows the component architecture of Flink. Flink provides high-concurrency pipeline data processing, millisecond-level latency, and high reliability, making it suitable for low-latency data processing.
- Specially designed for stream processing, supporting millisecond-level latency.
- Runs on Yarn to provide more flexible and comprehensive resource isolation, which is applicable to multi-tenant scenarios.
- Supports the asynchronous snapshot mechanism for backing up user job status and supports restoration of user jobs in certain status so that each event is processed exactly once.
Storm
Storm is a distributed, reliable, and fault-tolerant real-time stream data processing system. Figure 2 shows the component architecture of Storm. In Storm, a graph-shaped data structure called topology needs to be designed first for real-time computing. The topology will be submitted to a cluster. Then a master node in the cluster distributes codes and assigns tasks to worker nodes. A topology contains two roles: spout and bolt. A spout sends messages and sends data streams in tuples. A bolt converts the data streams and performs computing and filtering operations. The bolt can randomly send data to other bolts. Tuples sent by a spout are unchangeable arrays and map to fixed key-value pairs.
Service processing logics are encapsulated in the topology of Storm. A topology is a set of Spout (data source) and Bolt (logical processing) components that are connected using stream groupings in Direct Acyclic Graph (DAG) mode. All components (spout and bolt) in a topology are working in parallel. In a topology, you can specify the parallelism degree for each node. Then Storm allocates tasks in the cluster for computing to improve system processing capabilities.
Kafka
Kafka is a high throughput, distributed message system based on release and subscription. The Kafka technology can be used to set up a large-scale message system for caching messages in real-time stream processing. The most common function is to cache real-time data sources. Figure 3 shows the component architecture of Kafka.
- Specially designed for stream processing, supporting millisecond-level latency.
- Multiple partitions can be set for a topic to improve the concurrent throughput capability.
- The data cache copy mechanism is supported. The cache time can be set to ensure data reliability.
Redis
Redis is a high-performance, key-value in-memory database. It is suitable to function as caches or message queues in the system. Figure 4 shows the component architecture of Redis.
- Boasts high performance, with an IOPS of over 100,000.
- Supports various data types (String, Hash, List, Set, and Sorted Set).
- Merges and executes multiple operations at the atomic level, that is, supports transactions.
- Enables persistency through snapshots (full) and logs (incremental).
- Supports primary-secondary replication and synchronization, which applies to scenarios such as read/write separation, data backup, and disaster recovery (DR).



