Kafka Overview

Kafka is a distributed, partitioned, replicated message publishing and subscription system. Its features are similar to those of the Java Message Service (JMS), but the design is completely different. Kafka provides features such as message persistence, high throughput, distribution, multi-client support and real-time processing. It applies to both online and offline message consumption, such as regular message collection, website activeness collection, aggregation of statistical system operation data (monitoring data), and log collection. These scenarios engage large amounts of data collection for Internet services.

The Kafka test process is: 1. Create a Kafka topic. 2. Kafka randomly generates data and writes the data to the topic. 3. Kafka reads data from the topic and consumes the data.

Figure 1 Test process

A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber. That is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

For each topic, the Kafka cluster maintains a partitioned log, as shown in Figure 2:

Figure 2 Partition log

Each partition is an ordered, immutable sequence of records that is continually appended to a structured commit log. The records in the partitions are each assigned a sequential ID number called the offset that uniquely identifies each record within the partition.

Generally, to maximize the disk performance, the total number of partitions must be greater than the number of disks. In this way, each disk has at least one partition. Otherwise, the disk may not be used.

The Kafka cluster durably persists all published records (no matter whether they have been consumed or not) using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. The performance of Kafka is actually constant with respect to data size, so storing data for a long time is not a problem.

Parent topic: Introduction