ClickHouse Engine Configuration File

Before running ClickHouse, you need to modify the BoostKit-omnimv_1.1.0/config/omnimv_config_clickhouse.cfg file. An example of the configuration file is as follows:

[raw_file]
# Database name
database = default
work_path = ./clickhouse
[clickhouse]
ck_host = xx.xx.xx.xx
ck_port = 19000
# Cluster name (This parameter is optional when only local tables are queried.)
cluster_name = mycluster
num_round = 5
# Indicates whether to enable ClickHouse projections. Possible values are True and False.
enable_projection = False
# Timeout interval for creating and deleting a projection, in seconds.
projection_wait_time = 3600
# Data compression ratio = Number of data rows generated by a view/Number of rows to be processed by an SQL query read_rows. A smaller data compression ratio indicates better performance.
projection_compress_ratio = 0.8
# SQL running threshold. Views are not recommended for SQL statements whose running time is less than the threshold. The unit is ms.
sql_running_time_threshold = 100
# Maximum number of recommended projections in each table
projection_limit_each_table = 5
# Working mode for configuring tables during projection recommendation. Value True indicates that only distributed tables can be configured for clickhouse_tables; False indicates that only local tables can be configured for clickhouse_tables.
is_distributed_table = False
# Lists key tables for recommendation. Use commas (,) to separate the tables. In the multi-table scenario, you need to set it to all, that is, clickhouse_tables = all.
clickhouse_tables = lineorder_flat
# Specifies the system table from which query information is obtained. Generally, for local tables, information is queried from system.query_log, and for distributed tables, information is from other tables.
query_log_table = system.query_log
# (Initial extraction of query information) Number of queries that are extracted from a specified database and data table and whose running time exceeds the threshold in the configuration file. Query information is extracted from each table in query_log.
# (Incremental extraction of query information) Number of queries to be extracted incrementally
query_limit_each_table = 15
# Total number of queries to be extracted in the multi-table scenario
query_limit_mutil_table = 15
# Indicates whether to recommend projections for the aggregate function SQL that contains conditions (for example, select sum(multiif..)). Possible values are True and False.
# If the value is True, projections are recommended after the conditional expression is removed. If the value is False, projections are not recommended for the SQL statement.
projection_with_conditional_functions = False
# Indicates whether to materialize historical data for a new projection. Possible values are True and False. By default, historical data is not materialized.
allows_materialize_projection = False
# Use table_name.projection_name to specify the name of the projection to delete. table_name can only be a local table. Use commas (,) to separate multiple projections.
# If the projection specified by table_name.projection_name exists on all nodes and needs to be deleted from all the nodes, set [is_distributed_table] to True.
# If you do not need to delete any projection, set it to None.
drop_projections = None
[train]
# Arranges and combines aggregation columns to increase the amount of training data. You can limit the maximum number of training SQL statements to be expanded.
max_data_count = 20
epochs = 100
estimators = 50
max_depth = 5
learning_rate = 0.05
[recommend]
# recommend strategy
# options: ['sql', 'compress_ratio']
# Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections.
candidate_strategy = compress_ratio

**Table 1** ClickHouse engine configuration file
Parameter	Description	Default Value
database	Name of the database used for training	-
work_path	Working path, which stores information about SQL, materialized views, and the model	-
ck_host	IP address of the ClickHouse client	xx.xx.xx.xx
ck_port	Port number of the ClickHouse client.	-
cluster_name	Cluster name. This parameter is optional when only local tables are queried.	-
enable_projection	Indicates whether to enable ClickHouse projections. Possible values are True and False.	False
projection_wait_time	Timeout interval for creating and deleting a projection, in seconds.	3600
projection_compress_ratio	Threshold of the data compression ratio. If the ratio is greater than the threshold, the corresponding projection is not retained. Data compression ratio = Number of data rows generated by the view/Number of rows to be processed by the SQL query read_rows. A smaller data compression ratio indicates better performance.	0.8
sql_running_time_threshold	SQL running threshold. Views are not recommended for SQL statements whose running time is less than the threshold. The unit is ms.	100
projection_limit_each_table	Maximum number of recommended projections in each table	5
is_distributed_table	Working mode for configuring tables during projection recommendation. Value True indicates that only distributed tables can be configured for clickhouse_tables; False indicates that only local tables can be configured for clickhouse_tables.	False
clickhouse_tables	Lists key tables for recommendation. Use commas (,) to separate tables. In the multi-table scenario, you need to set it to all, that is, clickhouse_tables = all.	-
query_log_table	Specifies the system table from which query information is obtained. Generally, for local tables, information is queried from system.query_log, and for distributed tables, information is from other tables.	system.query_log
query_limit_each_table	(Initial extraction of query information) Number of queries that are extracted from a specified database and data table and whose running time exceeds the threshold in the configuration file. Query information is extracted from each table in query_log. (Incremental extraction of query information) Number of queries to be extracted incrementally	15
query_limit_mutil_table	Configures the total number of queries to be extracted in the multi-table scenario.	15
projection_with_conditional_functions	Indicates whether to recommend projections for the aggregate function SQL that contains conditions (for example, select sum(multiif..)). Possible values are True and False. If the value is True, projections are recommended after the conditional expression is removed. If the value is False, projections are not recommended for the SQL statement.	False
allows_materialize_projection	Indicates whether to materialize historical data for a new projection. Possible values are True and False. By default, historical data is not materialized.	False
drop_projections	Deletes the projection specified by table_name.projection_name. table_name must be a local table. Use commas (,) to separate multiple projections. If the projection specified by table_name.projection_name exists on all nodes and needs to be deleted from all the nodes, set [is_distributed_table] to True. If you do not want to delete any projection, set it to None.	None
max_data_count	Arranges and combines aggregation columns to increase the amount of training data. You can limit the maximum number of training SQL statements to be expanded.	20
epochs	Number of model training epochs	200
learning_rate	Learning rate of model training	0.005
estimators	Number of gradient boosting trees. This is a hyperparameter of the model.	50
max_depth	Depth of gradient boosting trees. This is a hyperparameter of the model.	5
candidate_strategy	Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections. Options: ['sql', 'compress_ratio']	compress_ratio

Parent topic: Materialized View Configuration File