OmniCache Configuration File

Before running OmniCache, you need to modify the BoostKit-omnicache_1.0.0/config/omnicache_config.cfg file. An example of the configuration file is as follows:

[schema]
# Schema path
schema_path = hdfs://server1:9000/omnicache_0919/schema.sql
[raw_file]
# Database name
database = tpcds_bin_partitioned_decimal_orc_8000
# Specifies the time segment for executing the query in Spark yarn mode.
q_log_start_time = 2022-09-16 20:28
q_log_end_time = 2022-09-17 08:50
# Specifies the time segment for executing the successfully rewritten query using the OmniCache plugin in Spark yarn mode.
q_mv_log_start_time = 2022-09-17 16:37
q_mv_log_end_time = 2022-09-17 19:20
# Path of the log parser JAR package
logparser_jar_path = /opt/onmnicache/boostkit-omnicache-logparser-spark-3.1.1-1.0.0-aarch64.jar
# Path of the OmniCache plugin
cache_plugin_jar_path = /opt/omnicache/boostkit-omnicache-spark-3.1.1-1.0.0-aarch64.jar
# Spark history output path configured in the spark-defaults.conf file
hdfs_input_path = hdfs://server1:9000/spark2-history
# Path for storing all logs after log parsing. Each log is parsed into the JSON format.
hdfs_output_path = hdfs://server1:9000/omnicache_0919/spark2-history-json
# Path for storing SQL statements. The following path is an example.
sqls_path = hdfs://server1:9000/omnicache_0919/tpcds
[csv]
# The following three files are intermediate result files of the recommend task, that is, the results of using the cost estimation model.
query_csv = hdfs://server1:9000/omnicache_0919/data/query_data.csv
mv_csv = hdfs://server1:9000/omnicache_0919/data/mv_data.csv
q_mv_csv = hdfs://server1:9000/omnicache_0919/data/query_mv_data.csv
# Path for storing candidate views. Subdirectories are generated in the directory: mv_sql (storing all candidate views), top_mv (storing top N candidate views), and mv_recommend (storing the recommended view).
mv_output_path = hdfs://server1:9000/omnicache_0919/mv
# Top N query-view mappings
q_mv_mapping_csv = hdfs://server1:9000/omnicache_0919/data/query-mv_mapping.csv
# Output result file of the recommend task, including the ID and cost of the recommended materialized view. You can find the original SQL statements of the materialized view in ${mv_output_path}/mv_sql based on the ID.
recommend_mv_csv = hdfs://server1:9000/omnicache_0919/data/recommend-mvdata.csv
[train]
# Ratio of training data. You are not advised to change the value.
train_rate = 0.8
# Training batch size. Adjust the value based on the value of N in top N. The batch size is an integer power of 2, and N/batch size should be about 10. Besides, the value of batch_size cannot exceed 256.
batch_size = 8
# Number of training epochs. Adjust the value based on the value of N in top N. The value majorly ranges from 30 to 100. When N is 20, set epoch to 30. When N is a larger value, increase the value of epoch relatively.
epoch = 20
# Learning rate. You are not advised to change the value.
learning_rate = 0.005
# The following are deep learning network parameters. You are not advised to adjust them.
weight_decay = 1e-5
keyword_embedding_size = 32
char_embedding_size = 32
node_auxiliary_size = 2
first_hidden_size = 32
second_hidden_size = 32
drop_rate = 0.1
# Path for saving the cost estimation model. Delete this file when switching the database platform for retraining.
model_save_path_q = hdfs://server1:9000/omnicache_0919/training/ckpt_q.pt
model_save_path_q_mv = hdfs://server1:9000/omnicache_0919/training/ckpt_q_mv.pt
# Path for storing the output data during model training
training_data_output_path = hdfs://server1:9000/omnicache_0919/training
[rl]
# Batch size for recommendation model training
batch_size_rl = 2
# Learning rate during recommendation model training
learning_rate_rl = 0.01
# The following are reinforcement learning parameters. You are not advised to adjust them.
epsilon = 0.9
gamma = 0.9
target_replace_iter = 3
memory_capacity = 10
drop_rate_rl = 0.2
[spark]
# Spark execution mode. The value can be yarn or local.
spark_master = yarn
# Parameter following the --name Spark parameter
app_name = omnicache
# Parameter specified by the --num-executors Spark parameter
spark_num_executors = 30
# Parameter specified by the --executor-memory Spark parameter
spark_executor_memory = 32g
# Parameter specified by the --driver-memory Spark parameter
spark_driver_memory = 48g
# Parameter specified by the --executor-cores Spark parameter
executor_cores = 18
# Timeout period for the spark-sql session, in seconds. The default value is 5000s.
session_timeout = 5000
[mv_limit]
# Maximum number of top candidate materialized views
cnt_limit = 40
space_limit = 1024
# The subtree whose reference count is less than refer_threshold is not used as a candidate view. The default value is 2.
refer_threshold=2
[mv_selection]
query_ratio = 0.6
space_ratio = 0.35
filter_ratio = 0.05

hdfs://server1:9000 indicates that data is cached in HDFS. server1 indicates the host name of the current server, and 9000 indicates the port number of HDFS. You can adjust the values based on your environment or use a simplified method. For example:

Change sqls_path = hdfs://server1:9000/omnicache_0919/tpcds to sqls_path = /omnicache_0919/tpcds.

In the configuration file, except the addresses of the two JAR packages, other addresses are HDFS paths by default.

**Table 1** OmniCache configuration items
Item	Description	Default Value
schema_path	Path for storing the table structure	-
database	Name of the database used for training	-
q_log_start_time	Start time of the original SQL query	-
q_log_end_time	End time of the original SQL query	-
q_mv_log_start_time	Start time of the rewritten SQL query using a materialized view	-
q_mv_log_end_time	End time of the rewritten SQL query using a materialized view	-
logparser_jar_path	Path of the log parser JAR package	-
cache_plugin_jar_path	Path of the OmniCache plugin JAR package	-
hdfs_input_path	Path for storing historical Spark logs	-
hdfs_output_path	Path for storing logs parsed by OmniCache	-
sqls_path	SQL set to be trained	-
query_csv	Path for storing query data	-
mv_csv	Path for storing view data	-
q_mv_csv	Path for storing query-view mappings	-
mv_output_path	Output path of candidate views	-
q_mv_mapping_csv	Path for storing query-view mappings	-
recommend_mv_csv	Output path of the recommended view	-
train_rate	Ratio of training data. You are not advised to change the value.	0.8
batch_size	Training batch size. Adjust the value based on the value of N in top N. The batch size is an integer power of 2, and N/batch size should be about 10. Besides, the value of batch_size cannot exceed 256.	8
epoch	Number of training epochs. Adjust the value based on the value of N in top N. The value majorly ranges from 30 to 100. When N is 20, set epoch to 30. When N is a larger value, increase the value of epoch relatively.	20
learning_rate	Learning rate. You are not advised to change the value.	0.005
weight_decay	Deep learning network parameter. You are not advised to change the value.	1e-5
keyword_embedding_size	Deep learning network parameter. You are not advised to change the value.	32
char_embedding_size	Deep learning network parameter. You are not advised to change the value.	32
node_auxiliary_size	Deep learning network parameter. You are not advised to change the value.	2
first_hidden_size	Deep learning network parameter. You are not advised to change the value.	32
second_hidden_size	Deep learning network parameter. You are not advised to change the value.	32
drop_rate	Deep learning network parameter. You are not advised to change the value.	0.1
model_save_path_q	Path for saving the cost estimation model. Delete this file when switching the database platform for retraining.	-
model_save_path_q_mv	Path for saving the cost estimation model. Delete this file when switching the database platform for retraining.	-
training_data_output_path	Path for storing the output data during model training.	-
batch_size_rl	Batch size for recommendation model training.	2
learning_rate_rl	Learning rate during recommendation model training.	0.01
epsilon	Reinforcement learning parameter. You are not advised to change the value.	0.9
gamma	Reinforcement learning parameter. You are not advised to change the value.	0.9
target_replace_iter	Reinforcement learning parameter. You are not advised to change the value.	3
memory_capacity	Reinforcement learning parameter. You are not advised to change the value.	10
drop_rate_rl	Reinforcement learning parameter. You are not advised to change the value.	0.2
spark_master	Spark execution mode.	yarn
app_name	Spark task name	-
spark_num_executors	Spark parameter --num-executors	30
spark_executor_memory	Spark parameter --executor-memory	32g
spark_driver_memory	Spark parameter --driver-memory	48g
executor_cores	Spark parameter --executor-cores	18
session_timeout	Timeout period of a spark-sql session. The default unit is second.	5000
cnt_limit	Maximum number of candidate views. The default value is 40. You can increase the value to ensure that more SQL statements are covered.	40
refer_threshold	The subtree whose reference count is less than refer_threshold is not used as a candidate view. The default value is 2. You are not advised to change the value.	2

Parent topic: Using OmniCache