Using Materialized Views on the ClickHouse Engine

Materialized views can be used as follows on the ClickHouse engine in five scenarios:

Scenario 1: Obtain the SQL statement from the ClickHouse execution log and directly parse the SQL statement to obtain the projection. Parsing recommendations does not occupy resources on the ClickHouse server.

Configuration file in the distributed table scenario:

database = ssb  # Target database
work_path = ./clickhouse # Path for storing working files such as materialized view SQL files, view files, and model files
ck_host = 0.0.0.0 # ClickHouse service IP address
ck_port = 9000 # ClickHouse service port
cluster_name = mycluster # Cluster name
projection_wait_time = 3600 # Timeout interval for creating a projection
is_distributed_table = True # Type of the ClickHouse table. False indicates that the table is a local table, and True indicates that the table is a distributed table.
clickhouse_tables = lineorder_flat_all # Target table. Use commas (,) to separate multiple tables. The tables must be in the target database.
query_log_table = system.query_log # Log table from which SQL statements are extracted
sql_running_time_threshold = 100 # SQL execution time threshold, in ms. SQL statements whose execution time is less than the threshold will not be processed.
query_limit_each_table = 15 # Maximum number of SQL statements that can be analyzed in each target table
projection_limit_each_table = 3 # Number of projections created for each table
projection_with_conditional_functions = False # Indicates whether to recommend projections for aggregate functions that contain conditions.
allows_materialize_projection = True # Indicates whether to materialize historical data of a new projection.

Configuration file in the local table scenario:

database = ssb  # Target database
work_path = ./clickhouse # Path for storing working files such as materialized view SQL files, view files, and model files
ck_host = 0.0.0.0 # ClickHouse service IP address
ck_port = 9000 # ClickHouse service port
projection_wait_time = 3600 # Timeout interval for creating a projection
is_distributed_table = False # Type of the ClickHouse table. False indicates that the table is a local table, and True indicates that the table is a distributed table.
clickhouse_tables = lineorder_flat # Target table. Use commas (,) to separate multiple tables. The tables must be in the target database.
query_log_table = system.query_log # Log table from which SQL statements are extracted
sql_running_time_threshold = 100 # SQL execution time threshold, in ms. SQL statements whose execution time is less than the threshold will not be processed.
query_limit_each_table = 15 # Maximum number of SQL statements that can be analyzed in each target table
projection_limit_each_table = 3 # Number of projections created for each table
projection_with_conditional_functions = False # Indicates whether to recommend projections for aggregate functions that contain conditions.
allows_materialize_projection = True # Indicates whether to materialize historical data of a new projection.

Procedure:

Access the installation directory of OmniMV.

Initialize the database information.

python main.pyc clickhouse init_database

Extract SQL statements that contain the target table query from query_log.
```
python main.pyc clickhouse extract_sqls
```
Parse the extracted SQL statements to obtain the recommended projection.
```
python main.pyc clickhouse parse_sql_text
```
Create a projection as the recommended one in the target table.
```
python main.pyc clickhouse create_projections
```

Scenario 2: Obtain the SQL statement from the ClickHouse execution log and parse the SQL statement to obtain the projection. You need to connect to the database to query the compression ratio. The greedy algorithm is recommended. Parsing recommendations occupies a lot of resources on the ClickHouse server.

Configuration file in the distributed table scenario:

database = ssb # Target database
work_path = ./clickhouse # Path for storing working files such as materialized view SQL files, view files, and model files
ck_host = 0.0.0.0 # ClickHouse service IP address
ck_port = 9000 # ClickHouse service port
cluster_name = mycluster # Cluster name
projection_wait_time = 3600 # Timeout interval for creating a projection
projection_compress_ratio = 0.8 # Compression ratio threshold. Projects whose data compression ratio is greater than the threshold are not recommended.
is_distributed_table = True # Type of the ClickHouse table. False indicates that the table is a local table, and True indicates that the table is a distributed table.
clickhouse_tables = lineorder_flat_all # Target table. Use commas (,) to separate multiple tables. The tables must be in the target database.
query_log_table = system.query_log # Log table from which SQL statements are extracted
sql_running_time_threshold = 100 # SQL execution time threshold, in ms. SQL statements whose execution time is less than the threshold will not be processed.
query_limit_each_table = 15 # Maximum number of SQL statements that can be analyzed in each target table
projection_limit_each_table = 3 # Number of projections created for each table
projection_with_conditional_functions = False # Indicates whether to recommend projections for aggregate functions that contain conditions.
allows_materialize_projection = True # Indicates whether to materialize historical data of a new projection.
candidate_strategy = compress_ratio # Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections.

Configuration file in the local table scenario:

database = ssb # Target database
work_path = ./clickhouse # Path for storing working files such as materialized view SQL files, view files, and model files
ck_host = 0.0.0.0 # ClickHouse service IP address
ck_port = 9000 # ClickHouse service port
projection_wait_time = 3600 # Timeout interval for creating a projection
projection_compress_ratio = 0.8 # Compression ratio threshold. Projects whose data compression ratio is greater than the threshold are not recommended.
is_distributed_table = False # Type of the ClickHouse table. False indicates that the table is a local table, and True indicates that the table is a distributed table.
clickhouse_tables = lineorder_flat # Target table. Use commas (,) to separate multiple tables. The tables must be in the target database.
query_log_table = system.query_log # Log table from which SQL statements are extracted
sql_running_time_threshold = 100 # SQL execution time threshold, in ms. SQL statements whose execution time is less than the threshold will not be processed.
query_limit_each_table = 15 # Maximum number of SQL statements that can be analyzed in each target table
projection_limit_each_table = 3 # Number of projections created for each table
projection_with_conditional_functions = False # Indicates whether to recommend projections for aggregate functions that contain conditions.
allows_materialize_projection = True # Indicates whether to materialize historical data of a new projection.
candidate_strategy = compress_ratio # Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections.

Access the installation directory of OmniMV.

Initialize the database information.

python main.pyc clickhouse init_database

Extract SQL statements that contain the target table query from query_log.
```
python main.pyc clickhouse extract_sqls
```
Parse the extracted SQL statements to obtain the recommended projection.
```
python main.pyc clickhouse generate_views
```
Create a projection as the recommended one in the target table.
```
python main.pyc clickhouse create_projections
```

Scenario 3: Obtain the SQL statement from the ClickHouse execution log and parse the SQL statement to obtain the projection. The recommended algorithms use the AI model whose training occupies plenty of resources on the ClickHouse server.

Configuration file in the distributed table scenario:

database = ssb
work_path = ./clickhouse
ck_host = 0.0.0.0
ck_port = 19000
cluster_name = mycluster
num_round = 5
projection_wait_time = 3600
projection_compress_ratio = 0.8
sql_running_time_threshold = 100
projection_limit_each_table = 5
is_distributed_table = True
clickhouse_tables = lineorder_flat_all
query_log_table = system.query_log
query_limit_each_table = 15
projection_with_conditional_functions = False
allows_materialize_projection = True
max_data_count = 20
epochs = 100
estimators = 50
max_depth = 5
learning_rate = 0.05
candidate_strategy = compress_ratio

Configuration file in the local table scenario:

database = ssb
work_path = ./clickhouse
ck_host = 0.0.0.0
ck_port = 19000
num_round = 5
projection_wait_time = 3600
projection_compress_ratio = 0.8
sql_running_time_threshold = 100
projection_limit_each_table = 5
is_distributed_table = False
clickhouse_tables = lineorder_flat
query_log_table = system.query_log
query_limit_each_table = 15
projection_with_conditional_functions = False
allows_materialize_projection = True
max_data_count = 20
epochs = 100
estimators = 50
max_depth = 5
learning_rate = 0.05
candidate_strategy = compress_ratio

Access the installation directory of OmniMV.

Initialize the database information.

python main.pyc clickhouse init_database

Extract SQL statements that contain the target table query from query_log.
```
python main.pyc clickhouse extract_sqls
```
Parse the extracted SQL statements to obtain the recommended projection.
```
python main.pyc clickhouse generate_views
```

Train the cost estimation model.

python main.pyc clickhouse cost_estimation

Train the recommendation model.
```
python main.pyc clickhouse recommend
```
Create a projection as the recommended one in the target table.
```
python main.pyc clickhouse create_projections_model
```

Scenario 4: Obtain the new SQL statement from the ClickHouse execution log again, parse the SQL statement to obtain the Projection, and use the trained AI model to recommend projections. Parsing recommendations does not occupy resources on the ClickHouse server.

Configuration file in the distributed table scenario:

database = ssb
work_path = ./clickhouse
ck_host = 0.0.0.0
ck_port = 19000
cluster_name = mycluster
projection_wait_time = 3600
sql_running_time_threshold = 100
projection_limit_each_table = 5
is_distributed_table = True
clickhouse_tables = lineorder_flat_all
query_log_table = system.query_log
query_limit_each_table = 15
projection_with_conditional_functions = False
allows_materialize_projection = True
candidate_strategy = compress_ratio

Configuration file in the local table scenario:

database = ssb
work_path = ./clickhouse
ck_host = 0.0.0.0
ck_port = 19000
projection_wait_time = 3600
sql_running_time_threshold = 100
projection_limit_each_table = 5
is_distributed_table = False
clickhouse_tables = lineorder_flat
query_log_table = system.query_log
query_limit_each_table = 15
projection_with_conditional_functions = False
allows_materialize_projection = True
candidate_strategy = compress_ratio

Access the installation directory of OmniMV.

Extract new SQL statements.

python main.pyc clickhouse extract_sqls_incremental

Parse the new SQL statements to obtain the projection view.
```
python main.pyc clickhouse parse_sql_text_incremental
```
Use the trained model for direct recommendation.
```
python main.pyc clickhouse recommend
```
Optional: Use new SQL statements to perform incremental training on the AI model. This will occupy a large number of resources on the ClickHouse server.
```
python main.pyc clickhouse cost_estimation_incremental
```

Scenario 5: For SQL statements for multi-table query, materialized views are directly obtained after parsing. No recommendation policy is available.

Configuration file in the multi-table scenario:

database = imdbload
work_path = ./clickhouse
ck_host = 0.0.0.0
ck_port = 19000
clickhouse_tables = all
query_log_table = system.query_log
query_limit_mutil_table = 15

Access the installation directory of OmniMV.

Initialize the database information.

python main.pyc clickhouse init_database

Obtain the multi-table query SQL statement and generate materialized views after parsing.
```
python main.pyc clickhouse generate_materialize_view
```

Parent topic: Using the Materialized View Feature