Rate This Document
Findability
Accuracy
Completeness
Readability

ClickHouse Engine Configuration File

Before running ClickHouse, you need to modify the BoostKit-omnimv_1.1.0/config/omnimv_config_clickhouse.cfg file. An example of the configuration file is as follows:

[raw_file]
# Database name
database = default
work_path = ./clickhouse
[clickhouse]
ck_host = xx.xx.xx.xx
ck_port = 19000
# Cluster name (This parameter is optional when only local tables are queried.)
cluster_name = mycluster
num_round = 5
# Indicates whether to enable ClickHouse projections. Possible values are True and False.
enable_projection = False
# Timeout interval for creating and deleting a projection, in seconds.
projection_wait_time = 3600
# Data compression ratio = Number of data rows generated by a view/Number of rows to be processed by an SQL query read_rows. A smaller data compression ratio indicates better performance.
projection_compress_ratio = 0.8
# SQL running threshold. Views are not recommended for SQL statements whose running time is less than the threshold. The unit is ms.
sql_running_time_threshold = 100
# Maximum number of recommended projections in each table
projection_limit_each_table = 5
# Working mode for configuring tables during projection recommendation. Value True indicates that only distributed tables can be configured for clickhouse_tables; False indicates that only local tables can be configured for clickhouse_tables.
is_distributed_table = False
# Lists key tables for recommendation. Use commas (,) to separate the tables. In the multi-table scenario, you need to set it to all, that is, clickhouse_tables = all.
clickhouse_tables = lineorder_flat
# Specifies the system table from which query information is obtained. Generally, for local tables, information is queried from system.query_log, and for distributed tables, information is from other tables.
query_log_table = system.query_log
# (Initial extraction of query information) Number of queries that are extracted from a specified database and data table and whose running time exceeds the threshold in the configuration file. Query information is extracted from each table in query_log.
# (Incremental extraction of query information) Number of queries to be extracted incrementally
query_limit_each_table = 15
# Total number of queries to be extracted in the multi-table scenario
query_limit_mutil_table = 15
# Indicates whether to recommend projections for the aggregate function SQL that contains conditions (for example, select sum(multiif..)). Possible values are True and False.
# If the value is True, projections are recommended after the conditional expression is removed. If the value is False, projections are not recommended for the SQL statement.
projection_with_conditional_functions = False
# Indicates whether to materialize historical data for a new projection. Possible values are True and False. By default, historical data is not materialized.
allows_materialize_projection = False
# Use table_name.projection_name to specify the name of the projection to delete. table_name can only be a local table. Use commas (,) to separate multiple projections.
# If the projection specified by table_name.projection_name exists on all nodes and needs to be deleted from all the nodes, set [is_distributed_table] to True.
# If you do not need to delete any projection, set it to None.
drop_projections = None
[train]
# Arranges and combines aggregation columns to increase the amount of training data. You can limit the maximum number of training SQL statements to be expanded.
max_data_count = 20
epochs = 100
estimators = 50
max_depth = 5
learning_rate = 0.05
[recommend]
# recommend strategy
# options: ['sql', 'compress_ratio']
# Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections.
candidate_strategy = compress_ratio
Table 1 ClickHouse engine configuration file

Parameter

Description

Default Value

database

Name of the database used for training

-

work_path

Working path, which stores information about SQL, materialized views, and the model

-

ck_host

IP address of the ClickHouse client

xx.xx.xx.xx

ck_port

Port number of the ClickHouse client.

-

cluster_name

Cluster name. This parameter is optional when only local tables are queried.

-

enable_projection

Indicates whether to enable ClickHouse projections. Possible values are True and False.

False

projection_wait_time

Timeout interval for creating and deleting a projection, in seconds.

3600

projection_compress_ratio

Threshold of the data compression ratio. If the ratio is greater than the threshold, the corresponding projection is not retained. Data compression ratio = Number of data rows generated by the view/Number of rows to be processed by the SQL query read_rows. A smaller data compression ratio indicates better performance.

0.8

sql_running_time_threshold

SQL running threshold. Views are not recommended for SQL statements whose running time is less than the threshold. The unit is ms.

100

projection_limit_each_table

Maximum number of recommended projections in each table

5

is_distributed_table

Working mode for configuring tables during projection recommendation. Value True indicates that only distributed tables can be configured for clickhouse_tables; False indicates that only local tables can be configured for clickhouse_tables.

False

clickhouse_tables

Lists key tables for recommendation. Use commas (,) to separate tables. In the multi-table scenario, you need to set it to all, that is, clickhouse_tables = all.

-

query_log_table

Specifies the system table from which query information is obtained. Generally, for local tables, information is queried from system.query_log, and for distributed tables, information is from other tables.

system.query_log

query_limit_each_table

  • (Initial extraction of query information) Number of queries that are extracted from a specified database and data table and whose running time exceeds the threshold in the configuration file. Query information is extracted from each table in query_log.
  • (Incremental extraction of query information) Number of queries to be extracted incrementally

15

query_limit_mutil_table

Configures the total number of queries to be extracted in the multi-table scenario.

15

projection_with_conditional_functions

Indicates whether to recommend projections for the aggregate function SQL that contains conditions (for example, select sum(multiif..)). Possible values are True and False.

If the value is True, projections are recommended after the conditional expression is removed. If the value is False, projections are not recommended for the SQL statement.

False

allows_materialize_projection

Indicates whether to materialize historical data for a new projection. Possible values are True and False. By default, historical data is not materialized.

False

drop_projections

Deletes the projection specified by table_name.projection_name. table_name must be a local table. Use commas (,) to separate multiple projections. If the projection specified by table_name.projection_name exists on all nodes and needs to be deleted from all the nodes, set [is_distributed_table] to True. If you do not want to delete any projection, set it to None.

None

max_data_count

Arranges and combines aggregation columns to increase the amount of training data. You can limit the maximum number of training SQL statements to be expanded.

20

epochs

Number of model training epochs

200

learning_rate

Learning rate of model training

0.005

estimators

Number of gradient boosting trees. This is a hyperparameter of the model.

50

max_depth

Depth of gradient boosting trees. This is a hyperparameter of the model.

5

candidate_strategy

Value compress_ratio indicates that data is sorted based on the projection compression ratio; SQL indicates that data is sorted based on the number of SQL statements that can be matched by projections. Options: ['sql', 'compress_ratio']

compress_ratio