Using Materialized Views on the ClickHouse Engine
Materialized views can be used as follows on the ClickHouse engine in five scenarios:
Scenario 1: Obtain the SQL statement from the ClickHouse execution log and directly parse the SQL statement to obtain the projection. Parsing recommendations does not occupy resources on the ClickHouse server.
Configuration file in the distributed table scenario:
database = ssb # Target database work_path = ./clickhouse # Path for storing working files such as materialized view SQL files, view files, and model files ck_host = 0.0.0.0 # ClickHouse service IP address ck_port = 9000 # ClickHouse service port cluster_name = mycluster # Cluster name projection_wait_time = 3600 # Timeout interval for creating a projection is_distributed_table = True # Type of the ClickHouse table. False indicates that the table is a local table, and True indicates that the table is a distributed table. clickhouse_tables = lineorder_flat_all # Target table. Use commas (,) to separate multiple tables. The tables must be in the target database. query_log_table = system.query_log # Log table from which SQL statements are extracted sql_running_time_threshold = 100 # SQL execution time threshold, in ms. SQL statements whose execution time is less than the threshold will not be processed. query_limit_each_table = 15 # Maximum number of SQL statements that can be analyzed in each target table projection_limit_each_table = 3 # Number of projections created for each table projection_with_conditional_functions = False # Indicates whether to recommend projections for aggregate functions that contain conditions. allows_materialize_projection = True # Indicates whether to materialize historical data of a new projection.
Configuration file in the local table scenario:
database = ssb # Target database work_path = ./clickhouse # Path for storing working files such as materialized view SQL files, view files, and model files ck_host = 0.0.0.0 # ClickHouse service IP address ck_port = 9000 # ClickHouse service port projection_wait_time = 3600 # Timeout interval for creating a projection is_distributed_table = False # Type of the ClickHouse table. False indicates that the table is a local table, and True indicates that the table is a distributed table. clickhouse_tables = lineorder_flat # Target table. Use commas (,) to separate multiple tables. The tables must be in the target database. query_log_table = system.query_log # Log table from which SQL statements are extracted sql_running_time_threshold = 100 # SQL execution time threshold, in ms. SQL statements whose execution time is less than the threshold will not be processed. query_limit_each_table = 15 # Maximum number of SQL statements that can be analyzed in each target table projection_limit_each_table = 3 # Number of projections created for each table projection_with_conditional_functions = False # Indicates whether to recommend projections for aggregate functions that contain conditions. allows_materialize_projection = True # Indicates whether to materialize historical data of a new projection.
Procedure:
- Access the installation directory of OmniMV.
- Initialize the database information.
python main.pyc clickhouse init_database
- Extract SQL statements that contain the target table query from query_log.
python main.pyc clickhouse extract_sqls
- Parse the extracted SQL statements to obtain the recommended projection.
python main.pyc clickhouse parse_sql_text
- Create a projection as the recommended one in the target table.
python main.pyc clickhouse create_projections
Scenario 2: Obtain the SQL statement from the ClickHouse execution log and parse the SQL statement to obtain the projection. You need to connect to the database to query the compression ratio. The greedy algorithm is recommended. Parsing recommendations occupies a lot of resources on the ClickHouse server.
Configuration file in the distributed table scenario:
database = ssb # Target database work_path = ./clickhouse # Path for storing working files such as materialized view SQL files, view files, and model files ck_host = 0.0.0.0 # ClickHouse service IP address ck_port = 9000 # ClickHouse service port cluster_name = mycluster # Cluster name projection_wait_time = 3600 # Timeout interval for creating a projection projection_compress_ratio = 0.8 # Compression ratio threshold. Projects whose data compression ratio is greater than the threshold are not recommended. is_distributed_table = True # Type of the ClickHouse table. False indicates that the table is a local table, and True indicates that the table is a distributed table. clickhouse_tables = lineorder_flat_all # Target table. Use commas (,) to separate multiple tables. The tables must be in the target database. query_log_table = system.query_log # Log table from which SQL statements are extracted sql_running_time_threshold = 100 # SQL execution time threshold, in ms. SQL statements whose execution time is less than the threshold will not be processed. query_limit_each_table = 15 # Maximum number of SQL statements that can be analyzed in each target table projection_limit_each_table = 3 # Number of projections created for each table projection_with_conditional_functions = False # Indicates whether to recommend projections for aggregate functions that contain conditions. allows_materialize_projection = True # Indicates whether to materialize historical data of a new projection. candidate_strategy = compress_ratio # Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections.
Configuration file in the local table scenario:
database = ssb # Target database work_path = ./clickhouse # Path for storing working files such as materialized view SQL files, view files, and model files ck_host = 0.0.0.0 # ClickHouse service IP address ck_port = 9000 # ClickHouse service port projection_wait_time = 3600 # Timeout interval for creating a projection projection_compress_ratio = 0.8 # Compression ratio threshold. Projects whose data compression ratio is greater than the threshold are not recommended. is_distributed_table = False # Type of the ClickHouse table. False indicates that the table is a local table, and True indicates that the table is a distributed table. clickhouse_tables = lineorder_flat # Target table. Use commas (,) to separate multiple tables. The tables must be in the target database. query_log_table = system.query_log # Log table from which SQL statements are extracted sql_running_time_threshold = 100 # SQL execution time threshold, in ms. SQL statements whose execution time is less than the threshold will not be processed. query_limit_each_table = 15 # Maximum number of SQL statements that can be analyzed in each target table projection_limit_each_table = 3 # Number of projections created for each table projection_with_conditional_functions = False # Indicates whether to recommend projections for aggregate functions that contain conditions. allows_materialize_projection = True # Indicates whether to materialize historical data of a new projection. candidate_strategy = compress_ratio # Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections.
- Access the installation directory of OmniMV.
- Initialize the database information.
python main.pyc clickhouse init_database
- Extract SQL statements that contain the target table query from query_log.
python main.pyc clickhouse extract_sqls
- Parse the extracted SQL statements to obtain the recommended projection.
python main.pyc clickhouse generate_views
- Create a projection as the recommended one in the target table.
python main.pyc clickhouse create_projections
Scenario 3: Obtain the SQL statement from the ClickHouse execution log and parse the SQL statement to obtain the projection. The recommended algorithms use the AI model whose training occupies plenty of resources on the ClickHouse server.
Configuration file in the distributed table scenario:
database = ssb work_path = ./clickhouse ck_host = 0.0.0.0 ck_port = 19000 cluster_name = mycluster num_round = 5 projection_wait_time = 3600 projection_compress_ratio = 0.8 sql_running_time_threshold = 100 projection_limit_each_table = 5 is_distributed_table = True clickhouse_tables = lineorder_flat_all query_log_table = system.query_log query_limit_each_table = 15 projection_with_conditional_functions = False allows_materialize_projection = True max_data_count = 20 epochs = 100 estimators = 50 max_depth = 5 learning_rate = 0.05 candidate_strategy = compress_ratio
Configuration file in the local table scenario:
database = ssb work_path = ./clickhouse ck_host = 0.0.0.0 ck_port = 19000 num_round = 5 projection_wait_time = 3600 projection_compress_ratio = 0.8 sql_running_time_threshold = 100 projection_limit_each_table = 5 is_distributed_table = False clickhouse_tables = lineorder_flat query_log_table = system.query_log query_limit_each_table = 15 projection_with_conditional_functions = False allows_materialize_projection = True max_data_count = 20 epochs = 100 estimators = 50 max_depth = 5 learning_rate = 0.05 candidate_strategy = compress_ratio
- Access the installation directory of OmniMV.
- Initialize the database information.
python main.pyc clickhouse init_database
- Extract SQL statements that contain the target table query from query_log.
python main.pyc clickhouse extract_sqls
- Parse the extracted SQL statements to obtain the recommended projection.
python main.pyc clickhouse generate_views
- Train the cost estimation model.
python main.pyc clickhouse cost_estimation
- Train the recommendation model.
python main.pyc clickhouse recommend
- Create a projection as the recommended one in the target table.
python main.pyc clickhouse create_projections_model
Scenario 4: Obtain the new SQL statement from the ClickHouse execution log again, parse the SQL statement to obtain the Projection, and use the trained AI model to recommend projections. Parsing recommendations does not occupy resources on the ClickHouse server.
Configuration file in the distributed table scenario:
database = ssb work_path = ./clickhouse ck_host = 0.0.0.0 ck_port = 19000 cluster_name = mycluster projection_wait_time = 3600 sql_running_time_threshold = 100 projection_limit_each_table = 5 is_distributed_table = True clickhouse_tables = lineorder_flat_all query_log_table = system.query_log query_limit_each_table = 15 projection_with_conditional_functions = False allows_materialize_projection = True candidate_strategy = compress_ratio
Configuration file in the local table scenario:
database = ssb work_path = ./clickhouse ck_host = 0.0.0.0 ck_port = 19000 projection_wait_time = 3600 sql_running_time_threshold = 100 projection_limit_each_table = 5 is_distributed_table = False clickhouse_tables = lineorder_flat query_log_table = system.query_log query_limit_each_table = 15 projection_with_conditional_functions = False allows_materialize_projection = True candidate_strategy = compress_ratio
- Access the installation directory of OmniMV.
- Extract new SQL statements.
python main.pyc clickhouse extract_sqls_incremental
- Parse the new SQL statements to obtain the projection view.
python main.pyc clickhouse parse_sql_text_incremental
- Use the trained model for direct recommendation.
python main.pyc clickhouse recommend
- Optional: Use new SQL statements to perform incremental training on the AI model. This will occupy a large number of resources on the ClickHouse server.
python main.pyc clickhouse cost_estimation_incremental
Scenario 5: For SQL statements for multi-table query, materialized views are directly obtained after parsing. No recommendation policy is available.
Configuration file in the multi-table scenario:
database = imdbload work_path = ./clickhouse ck_host = 0.0.0.0 ck_port = 19000 clickhouse_tables = all query_log_table = system.query_log query_limit_mutil_table = 15
- Access the installation directory of OmniMV.
- Initialize the database information.
python main.pyc clickhouse init_database
- Obtain the multi-table query SQL statement and generate materialized views after parsing.
python main.pyc clickhouse generate_materialize_view