Open Source MPP Greenplum Reference Architectures

Scenarios

The open source Greenplum is a typical OLAP system based on the MPP architecture and is developed based on the open source PostgreSQL.

OLAP has the following characteristics:

No data is generated. The basic data comes from the production system.
It is an analytical system based on query.
Complex queries often involve multi-table joins and full table scans.
The response time is closely related to the specific query.
The number of users is small.
Distributed clusters with high scalability are involved.

Architectures and Principles

Figure 1 Typical component architecture

Primary node: functions as the control center of the entire system and an external service access point. It receives SQL requests from users, generates query plans based on the SQL statements, performs parallel processing optimization, dispatches the query plans to all segment nodes for processing, coordinates the segment nodes to perform parallel processing step by step according to the query plan, obtains the segment processing result, and returns the result to the client. From the users, only the primary node in the Greenplum cluster is visible. All parallel processing is automatically completed under the control of the primary node. Generally, there are only one or two primary nodes (backup for each other).
Segment DB node: performs parallel computing in the Greenplum cluster. It receives instructions from the primary node and performs massively parallel processing (MPP). Therefore, the total computing performance of all segment nodes is the performance of the entire cluster. Adding segment nodes can linearly improve the processing performance and storage capacity of the cluster. A cluster supports a maximum of 10,000 segment nodes.
Interconnect: transfers data between the primary and segment nodes or between segment nodes. GE or 10GE switches are used to implement high-speed data transmission between nodes.
Greenplum database product uses MPP techniques. External data is loaded to the segment nodes directly through parallel data streams, which ensures that external data is loaded to the database in the shortest time.

The primary and standby nodes are deployed on different servers to improve HA.
Two-copy redundancy is used between Segment DB nodes. More than four Segment DB nodes are recommended on each server to improve CPU utilization.

Parent topic: Architecture