ClickHouse Engine Configuration File
Before running ClickHouse, you need to modify the BoostKit-omnimv_1.1.0/config/omnimv_config_clickhouse.cfg file. An example of the configuration file is as follows:
[raw_file] # Database name database = default work_path = ./clickhouse [clickhouse] ck_host = xx.xx.xx.xx ck_port = 19000 # Cluster name (This parameter is optional when only local tables are queried.) cluster_name = mycluster num_round = 5 # Indicates whether to enable ClickHouse projections. Possible values are True and False. enable_projection = False # Timeout interval for creating and deleting a projection, in seconds. projection_wait_time = 3600 # Data compression ratio = Number of data rows generated by a view/Number of rows to be processed by an SQL query read_rows. A smaller data compression ratio indicates better performance. projection_compress_ratio = 0.8 # SQL running threshold. Views are not recommended for SQL statements whose running time is less than the threshold. The unit is ms. sql_running_time_threshold = 100 # Maximum number of recommended projections in each table projection_limit_each_table = 5 # Working mode for configuring tables during projection recommendation. Value True indicates that only distributed tables can be configured for clickhouse_tables; False indicates that only local tables can be configured for clickhouse_tables. is_distributed_table = False # Lists key tables for recommendation. Use commas (,) to separate the tables. In the multi-table scenario, you need to set it to all, that is, clickhouse_tables = all. clickhouse_tables = lineorder_flat # Specifies the system table from which query information is obtained. Generally, for local tables, information is queried from system.query_log, and for distributed tables, information is from other tables. query_log_table = system.query_log # (Initial extraction of query information) Number of queries that are extracted from a specified database and data table and whose running time exceeds the threshold in the configuration file. Query information is extracted from each table in query_log. # (Incremental extraction of query information) Number of queries to be extracted incrementally query_limit_each_table = 15 # Total number of queries to be extracted in the multi-table scenario query_limit_mutil_table = 15 # Indicates whether to recommend projections for the aggregate function SQL that contains conditions (for example, select sum(multiif..)). Possible values are True and False. # If the value is True, projections are recommended after the conditional expression is removed. If the value is False, projections are not recommended for the SQL statement. projection_with_conditional_functions = False # Indicates whether to materialize historical data for a new projection. Possible values are True and False. By default, historical data is not materialized. allows_materialize_projection = False # Use table_name.projection_name to specify the name of the projection to delete. table_name can only be a local table. Use commas (,) to separate multiple projections. # If the projection specified by table_name.projection_name exists on all nodes and needs to be deleted from all the nodes, set [is_distributed_table] to True. # If you do not need to delete any projection, set it to None. drop_projections = None [train] # Arranges and combines aggregation columns to increase the amount of training data. You can limit the maximum number of training SQL statements to be expanded. max_data_count = 20 epochs = 100 estimators = 50 max_depth = 5 learning_rate = 0.05 [recommend] # recommend strategy # options: ['sql', 'compress_ratio'] # Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections. candidate_strategy = compress_ratio
Parameter |
Description |
Default Value |
|---|---|---|
database |
Name of the database used for training |
- |
work_path |
Working path, which stores information about SQL, materialized views, and the model |
- |
ck_host |
IP address of the ClickHouse client |
xx.xx.xx.xx |
ck_port |
Port number of the ClickHouse client. |
- |
cluster_name |
Cluster name. This parameter is optional when only local tables are queried. |
- |
enable_projection |
Indicates whether to enable ClickHouse projections. Possible values are True and False. |
False |
projection_wait_time |
Timeout interval for creating and deleting a projection, in seconds. |
3600 |
projection_compress_ratio |
Threshold of the data compression ratio. If the ratio is greater than the threshold, the corresponding projection is not retained. Data compression ratio = Number of data rows generated by the view/Number of rows to be processed by the SQL query read_rows. A smaller data compression ratio indicates better performance. |
0.8 |
sql_running_time_threshold |
SQL running threshold. Views are not recommended for SQL statements whose running time is less than the threshold. The unit is ms. |
100 |
projection_limit_each_table |
Maximum number of recommended projections in each table |
5 |
is_distributed_table |
Working mode for configuring tables during projection recommendation. Value True indicates that only distributed tables can be configured for clickhouse_tables; False indicates that only local tables can be configured for clickhouse_tables. |
False |
clickhouse_tables |
Lists key tables for recommendation. Use commas (,) to separate tables. In the multi-table scenario, you need to set it to all, that is, clickhouse_tables = all. |
- |
query_log_table |
Specifies the system table from which query information is obtained. Generally, for local tables, information is queried from system.query_log, and for distributed tables, information is from other tables. |
system.query_log |
query_limit_each_table |
|
15 |
query_limit_mutil_table |
Configures the total number of queries to be extracted in the multi-table scenario. |
15 |
projection_with_conditional_functions |
Indicates whether to recommend projections for the aggregate function SQL that contains conditions (for example, select sum(multiif..)). Possible values are True and False. If the value is True, projections are recommended after the conditional expression is removed. If the value is False, projections are not recommended for the SQL statement. |
False |
allows_materialize_projection |
Indicates whether to materialize historical data for a new projection. Possible values are True and False. By default, historical data is not materialized. |
False |
drop_projections |
Deletes the projection specified by table_name.projection_name. table_name must be a local table. Use commas (,) to separate multiple projections. If the projection specified by table_name.projection_name exists on all nodes and needs to be deleted from all the nodes, set [is_distributed_table] to True. If you do not want to delete any projection, set it to None. |
None |
max_data_count |
Arranges and combines aggregation columns to increase the amount of training data. You can limit the maximum number of training SQL statements to be expanded. |
20 |
epochs |
Number of model training epochs |
200 |
learning_rate |
Learning rate of model training |
0.005 |
estimators |
Number of gradient boosting trees. This is a hyperparameter of the model. |
50 |
max_depth |
Depth of gradient boosting trees. This is a hyperparameter of the model. |
5 |
candidate_strategy |
Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections. Options: ['sql', 'compress_ratio'] |
compress_ratio |