Storm Overview
Storm Basic Components
- Nimbus
Sends code in the cluster, assigns jobs to machines, and monitors status. There is only one Nimbus component in a cluster.
- ZooKeeper
Vital external resource that Storm depends on. Components such as Nimbus, supervisors, and workers store heartbeat data on ZooKeeper. Nimbus schedules and allocates tasks based on the heartbeat and task running status on ZooKeeper.
- Supervisor
Monitors allocated tasks on running nodes and starts or stops worker processes when required. Each host that running Storm runs a supervisor.
- Worker
A JVM instance created on a supervisor. Executors run in a worker and serve as the container of tasks.
- Executor
Container for running tasks. The task processing logic is executed in executors. One or more executor instances can run in the same worker process, and one or more tasks can run in the same executor.
- Task
All entities displayed during spout/bolt running are called tasks. One spout/bolt may correspond to one or more spout tasks or bolt tasks.
Storm Stream Processing Flow
In Storm, data can be processed by serializing components or combining multiple types of stream operations.
- Serialize Storm components.
This method is the simplest and most intuitive method. You only need to serialize Storm components (spouts or bolts) and understand the basic methods of editing these components.
- Combine multiple types of stream operations of Storm.
- Storm supports stream aggregation. Data of multiple components can be aggregated to the same processing component for unified processing. Multiple spouts can be aggregated to bolts (many-to-one or many-to-many). Multiple bolts can be aggregated to other bolts (many-to-one or many-to-many).
Basic Concepts of Storm
A topology is a program submitted and run by Storm. The minimum message unit processed by a topology is a tuple, which is an array of any objects. A topology consists of spouts and bolts. A spout is a node that sends tuples. Bolts can subscribe to tuples sent by spouts or bolts. Figure 1 shows an example of the logical diagram of a topology.
- Topology
A topology is an object used to orchestrate and contain a group of computing logic components (spouts and bolts). The computing components can be orchestrated in a directed acyclic graph (DAG) to form an object with more complex computing logic. After a Topology is started, it runs permanently unless it is stopped manually (such as running bin/storm kill explicitly) or due to an unexpected fault (such as shutdown or suspension of the entire Storm cluster).
- Spout
A spout is a message source of a topology. It continuously produces messages. In Storm, messages produced by spouts are abstracted as tuples. Multiple computing components of a topology are connected by streams of abstract tuple messages.
- Bolt
The message processing logic of Storm is encapsulated into bolts. Any processing logic can be executed in bolts. Bolts can receive tuples from one or more spouts, from multiple other bolts, or from a combination of spouts and bolts.
- Stream Grouping
In Storm, this component defines the connections, grouping, and distribution relationships of streams between computing components (spouts and bolts). The following distribution strategies are defined in Storm: shuffle grouping, fields grouping (grouping by fields), all grouping (broadcast grouping), global grouping, non-grouping, and direct grouping, local or shuffle grouping.
Storm Topology Submission Process
- The client uploads JAR packages to the Inbox directory of Nimbus through the Nimbus interface. After the upload is complete, a topology is submitted to Nimbus.
- After receiving the command for submitting the topology, Nimbus serializes the received JAR packages. After static information is set, tasks are allocated to nodes through the heartbeat information. Then, the system evenly allocates the tasks based on the number of workers. Storm determines the supervisor where the workers run.
- After tasks are allocated, the Nimbus node submits the task information to the ZooKeeper cluster. In addition, the ZooKeeper cluster contains the worker allocation node, which stores the heartbeat information of all worker processes of the current topology.
- The supervisor node continuously polls the ZooKeeper cluster. The allocation node in ZooKeeper stores the all topology task assignment information, code storage directories, and relationships between tasks. The supervisor polls the allocation node to obtain its own tasks and start the worker processes.
After a topology starts running, spouts continuously send streams, which are continuously processed by bolts.
