Rate This Document
Findability
Accuracy
Completeness
Readability

OmniCache Configuration File

Before running OmniCache, you need to modify the BoostKit-omnicache_1.0.0/config/omnicache_config.cfg file. An example of the configuration file is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
[schema]
# Schema path
schema_path = hdfs://server1:9000/omnicache_0919/schema.sql
[raw_file]
# Database name
database = tpcds_bin_partitioned_decimal_orc_8000
# Specifies the time segment for executing the query in Spark yarn mode.
q_log_start_time = 2022-09-16 20:28
q_log_end_time = 2022-09-17 08:50
# Specifies the time segment for executing the successfully rewritten query using the OmniCache plugin in Spark yarn mode.
q_mv_log_start_time = 2022-09-17 16:37
q_mv_log_end_time = 2022-09-17 19:20
# Path of the log parser JAR package
logparser_jar_path = /opt/onmnicache/boostkit-omnicache-logparser-spark-3.1.1-1.0.0-aarch64.jar
# Path of the OmniCache plugin
cache_plugin_jar_path = /opt/omnicache/boostkit-omnicache-spark-3.1.1-1.0.0-aarch64.jar
# Spark history output path configured in the spark-defaults.conf file
hdfs_input_path = hdfs://server1:9000/spark2-history
# Path for storing all logs after log parsing. Each log is parsed into the JSON format.
hdfs_output_path = hdfs://server1:9000/omnicache_0919/spark2-history-json
# Path for storing SQL statements. The following path is an example.
sqls_path = hdfs://server1:9000/omnicache_0919/tpcds
[csv]
# The following three files are intermediate result files of the recommend task, that is, the results of using the cost estimation model.
query_csv = hdfs://server1:9000/omnicache_0919/data/query_data.csv
mv_csv = hdfs://server1:9000/omnicache_0919/data/mv_data.csv
q_mv_csv = hdfs://server1:9000/omnicache_0919/data/query_mv_data.csv
# Path for storing candidate views. Subdirectories are generated in the directory: mv_sql (storing all candidate views), top_mv (storing top N candidate views), and mv_recommend (storing the recommended view).
mv_output_path = hdfs://server1:9000/omnicache_0919/mv
# Top N query-view mappings
q_mv_mapping_csv = hdfs://server1:9000/omnicache_0919/data/query-mv_mapping.csv
# Output result file of the recommend task, including the ID and cost of the recommended materialized view. You can find the original SQL statements of the materialized view in ${mv_output_path}/mv_sql based on the ID.
recommend_mv_csv = hdfs://server1:9000/omnicache_0919/data/recommend-mvdata.csv
[train]
# Ratio of training data. You are not advised to change the value.
train_rate = 0.8
# Training batch size. Adjust the value based on the value of N in top N. The batch size is an integer power of 2, and N/batch size should be about 10. Besides, the value of batch_size cannot exceed 256.
batch_size = 8
# Number of training epochs. Adjust the value based on the value of N in top N. The value majorly ranges from 30 to 100. When N is 20, set epoch to 30. When N is a larger value, increase the value of epoch relatively.
epoch = 20
# Learning rate. You are not advised to change the value.
learning_rate = 0.005
# The following are deep learning network parameters. You are not advised to adjust them.
weight_decay = 1e-5
keyword_embedding_size = 32
char_embedding_size = 32
node_auxiliary_size = 2
first_hidden_size = 32
second_hidden_size = 32
drop_rate = 0.1
# Path for saving the cost estimation model. Delete this file when switching the database platform for retraining.
model_save_path_q = hdfs://server1:9000/omnicache_0919/training/ckpt_q.pt
model_save_path_q_mv = hdfs://server1:9000/omnicache_0919/training/ckpt_q_mv.pt
# Path for storing the output data during model training
training_data_output_path = hdfs://server1:9000/omnicache_0919/training
[rl]
# Batch size for recommendation model training
batch_size_rl = 2
# Learning rate during recommendation model training
learning_rate_rl = 0.01
# The following are reinforcement learning parameters. You are not advised to adjust them.
epsilon = 0.9
gamma = 0.9
target_replace_iter = 3
memory_capacity = 10
drop_rate_rl = 0.2
[spark]
# Spark execution mode. The value can be yarn or local.
spark_master = yarn
# Parameter following the --name Spark parameter
app_name = omnicache
# Parameter specified by the --num-executors Spark parameter
spark_num_executors = 30
# Parameter specified by the --executor-memory Spark parameter
spark_executor_memory = 32g
# Parameter specified by the --driver-memory Spark parameter
spark_driver_memory = 48g
# Parameter specified by the --executor-cores Spark parameter
executor_cores = 18
# Timeout period for the spark-sql session, in seconds. The default value is 5000s.
session_timeout = 5000
[mv_limit]
# Maximum number of top candidate materialized views
cnt_limit = 40
space_limit = 1024
# The subtree whose reference count is less than refer_threshold is not used as a candidate view. The default value is 2.
refer_threshold=2
[mv_selection]
query_ratio = 0.6
space_ratio = 0.35
filter_ratio = 0.05

hdfs://server1:9000 indicates that data is cached in HDFS. server1 indicates the host name of the current server, and 9000 indicates the port number of HDFS. You can adjust the values based on your environment or use a simplified method. For example:

Change sqls_path = hdfs://server1:9000/omnicache_0919/tpcds to sqls_path = /omnicache_0919/tpcds.

In the configuration file, except the addresses of the two JAR packages, other addresses are HDFS paths by default.

Table 1 OmniCache configuration items

Item

Description

Default Value

schema_path

Path for storing the table structure

-

database

Name of the database used for training

-

q_log_start_time

Start time of the original SQL query

-

q_log_end_time

End time of the original SQL query

-

q_mv_log_start_time

Start time of the rewritten SQL query using a materialized view

-

q_mv_log_end_time

End time of the rewritten SQL query using a materialized view

-

logparser_jar_path

Path of the log parser JAR package

-

cache_plugin_jar_path

Path of the OmniCache plugin JAR package

-

hdfs_input_path

Path for storing historical Spark logs

-

hdfs_output_path

Path for storing logs parsed by OmniCache

-

sqls_path

SQL set to be trained

-

query_csv

Path for storing query data

-

mv_csv

Path for storing view data

-

q_mv_csv

Path for storing query-view mappings

-

mv_output_path

Output path of candidate views

-

q_mv_mapping_csv

Path for storing query-view mappings

-

recommend_mv_csv

Output path of the recommended view

-

train_rate

Ratio of training data. You are not advised to change the value.

0.8

batch_size

Training batch size. Adjust the value based on the value of N in top N. The batch size is an integer power of 2, and N/batch size should be about 10. Besides, the value of batch_size cannot exceed 256.

8

epoch

Number of training epochs. Adjust the value based on the value of N in top N. The value majorly ranges from 30 to 100. When N is 20, set epoch to 30. When N is a larger value, increase the value of epoch relatively.

20

learning_rate

Learning rate. You are not advised to change the value.

0.005

weight_decay

Deep learning network parameter. You are not advised to change the value.

1e-5

keyword_embedding_size

Deep learning network parameter. You are not advised to change the value.

32

char_embedding_size

Deep learning network parameter. You are not advised to change the value.

32

node_auxiliary_size

Deep learning network parameter. You are not advised to change the value.

2

first_hidden_size

Deep learning network parameter. You are not advised to change the value.

32

second_hidden_size

Deep learning network parameter. You are not advised to change the value.

32

drop_rate

Deep learning network parameter. You are not advised to change the value.

0.1

model_save_path_q

Path for saving the cost estimation model. Delete this file when switching the database platform for retraining.

-

model_save_path_q_mv

Path for saving the cost estimation model. Delete this file when switching the database platform for retraining.

-

training_data_output_path

Path for storing the output data during model training.

-

batch_size_rl

Batch size for recommendation model training.

2

learning_rate_rl

Learning rate during recommendation model training.

0.01

epsilon

Reinforcement learning parameter. You are not advised to change the value.

0.9

gamma

Reinforcement learning parameter. You are not advised to change the value.

0.9

target_replace_iter

Reinforcement learning parameter. You are not advised to change the value.

3

memory_capacity

Reinforcement learning parameter. You are not advised to change the value.

10

drop_rate_rl

Reinforcement learning parameter. You are not advised to change the value.

0.2

spark_master

Spark execution mode.

yarn

app_name

Spark task name

-

spark_num_executors

Spark parameter --num-executors

30

spark_executor_memory

Spark parameter --executor-memory

32g

spark_driver_memory

Spark parameter --driver-memory

48g

executor_cores

Spark parameter --executor-cores

18

session_timeout

Timeout period of a spark-sql session. The default unit is second.

5000

cnt_limit

Maximum number of candidate views. The default value is 40. You can increase the value to ensure that more SQL statements are covered.

40

refer_threshold

The subtree whose reference count is less than refer_threshold is not used as a candidate view. The default value is 2. You are not advised to change the value.

2