Tuning Database Segments
Table Distribution
Select a proper distribution key. A distribution key is a unique index in a table and can be one or more columns. In ideal cases, data can be evenly distributed on each segment in a single column. A proper distribution key should have the same number of records on each segment.
Guidelines for selecting a distribution key:
- Select columns that require JOIN frequently to reduce the overhead caused by redistribution or broadcasting of small tables.
- Select an evenly distributed column or multiple columns to reduce the risk that the time is consumed by a single segment.
- Do not use a column as the distribution key and partition key at the same time.
- If no proper distribution key is available, select random distribution and select small tables similar to dimension tables. Do not select random distribution for large tables similar to fact tables.
Table Partitioning
A proper partitioning policy enables that only related data is accessed. Data distribution is to physically split table data and distribute data to each node so that data can be concurrently computed. This is mandatory in Greenplum. Table partitioning logically splits data in large tables to improve query performance and facilitate data lifecycle management. Partitioning and distribution can be used together to improve data performance.
Guidelines for designing partitions:
- Determine whether the data volume of a table is large: Partitioning can be used for a large fact table, for example, a table with tens of millions or hundreds of millions of data records. However, there is no absolute standard to determine the data volume. You can decide whether to use table partitioning based on experience or the current system performance.
- Determine whether a table has a proper partition key: If the data volume is large enough, check whether there is a proper key that can be used for partitioning. A column that has the time dimension, such as day or month, is an ideal partition key.
- Determine whether data in a table has a life cycle: Generally, data in a data warehouse has a certain life cycle, for example, one year. This involves the management of old data. If the table is partitioned, old data will be deleted or archived to cheaper storage media such as object storage.
- Determine whether a query statement contains the partition key: If a table is partitioned but all queries do not contain the partition key, the performance cannot be improved but deteriorates because these queries scan all partition tables.
Table Compression and Column-Store
Greenplum supports both row-based and column-based storage. Row-based storage is suitable for typical OLTP scenarios, while column-based storage is suitable for OLAP scenarios, especially for large-scale aggregation analysis by column. Column-store compression reduces I/O access.