OmniCache Configuration File
Before running OmniCache, you need to modify the BoostKit-omnicache_1.0.0/config/omnicache_config.cfg file. An example of the configuration file is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | [schema] # Schema path schema_path = hdfs://server1:9000/omnicache_0919/schema.sql [raw_file] # Database name database = tpcds_bin_partitioned_decimal_orc_8000 # Specifies the time segment for executing the query in Spark yarn mode. q_log_start_time = 2022-09-16 20:28 q_log_end_time = 2022-09-17 08:50 # Specifies the time segment for executing the successfully rewritten query using the OmniCache plugin in Spark yarn mode. q_mv_log_start_time = 2022-09-17 16:37 q_mv_log_end_time = 2022-09-17 19:20 # Path of the log parser JAR package logparser_jar_path = /opt/onmnicache/boostkit-omnicache-logparser-spark-3.1.1-1.0.0-aarch64.jar # Path of the OmniCache plugin cache_plugin_jar_path = /opt/omnicache/boostkit-omnicache-spark-3.1.1-1.0.0-aarch64.jar # Spark history output path configured in the spark-defaults.conf file hdfs_input_path = hdfs://server1:9000/spark2-history # Path for storing all logs after log parsing. Each log is parsed into the JSON format. hdfs_output_path = hdfs://server1:9000/omnicache_0919/spark2-history-json # Path for storing SQL statements. The following path is an example. sqls_path = hdfs://server1:9000/omnicache_0919/tpcds [csv] # The following three files are intermediate result files of the recommend task, that is, the results of using the cost estimation model. query_csv = hdfs://server1:9000/omnicache_0919/data/query_data.csv mv_csv = hdfs://server1:9000/omnicache_0919/data/mv_data.csv q_mv_csv = hdfs://server1:9000/omnicache_0919/data/query_mv_data.csv # Path for storing candidate views. Subdirectories are generated in the directory: mv_sql (storing all candidate views), top_mv (storing top N candidate views), and mv_recommend (storing the recommended view). mv_output_path = hdfs://server1:9000/omnicache_0919/mv # Top N query-view mappings q_mv_mapping_csv = hdfs://server1:9000/omnicache_0919/data/query-mv_mapping.csv # Output result file of the recommend task, including the ID and cost of the recommended materialized view. You can find the original SQL statements of the materialized view in ${mv_output_path}/mv_sql based on the ID. recommend_mv_csv = hdfs://server1:9000/omnicache_0919/data/recommend-mvdata.csv [train] # Ratio of training data. You are not advised to change the value. train_rate = 0.8 # Training batch size. Adjust the value based on the value of N in top N. The batch size is an integer power of 2, and N/batch size should be about 10. Besides, the value of batch_size cannot exceed 256. batch_size = 8 # Number of training epochs. Adjust the value based on the value of N in top N. The value majorly ranges from 30 to 100. When N is 20, set epoch to 30. When N is a larger value, increase the value of epoch relatively. epoch = 20 # Learning rate. You are not advised to change the value. learning_rate = 0.005 # The following are deep learning network parameters. You are not advised to adjust them. weight_decay = 1e-5 keyword_embedding_size = 32 char_embedding_size = 32 node_auxiliary_size = 2 first_hidden_size = 32 second_hidden_size = 32 drop_rate = 0.1 # Path for saving the cost estimation model. Delete this file when switching the database platform for retraining. model_save_path_q = hdfs://server1:9000/omnicache_0919/training/ckpt_q.pt model_save_path_q_mv = hdfs://server1:9000/omnicache_0919/training/ckpt_q_mv.pt # Path for storing the output data during model training training_data_output_path = hdfs://server1:9000/omnicache_0919/training [rl] # Batch size for recommendation model training batch_size_rl = 2 # Learning rate during recommendation model training learning_rate_rl = 0.01 # The following are reinforcement learning parameters. You are not advised to adjust them. epsilon = 0.9 gamma = 0.9 target_replace_iter = 3 memory_capacity = 10 drop_rate_rl = 0.2 [spark] # Spark execution mode. The value can be yarn or local. spark_master = yarn # Parameter following the --name Spark parameter app_name = omnicache # Parameter specified by the --num-executors Spark parameter spark_num_executors = 30 # Parameter specified by the --executor-memory Spark parameter spark_executor_memory = 32g # Parameter specified by the --driver-memory Spark parameter spark_driver_memory = 48g # Parameter specified by the --executor-cores Spark parameter executor_cores = 18 # Timeout period for the spark-sql session, in seconds. The default value is 5000s. session_timeout = 5000 [mv_limit] # Maximum number of top candidate materialized views cnt_limit = 40 space_limit = 1024 # The subtree whose reference count is less than refer_threshold is not used as a candidate view. The default value is 2. refer_threshold=2 [mv_selection] query_ratio = 0.6 space_ratio = 0.35 filter_ratio = 0.05 |
hdfs://server1:9000 indicates that data is cached in HDFS. server1 indicates the host name of the current server, and 9000 indicates the port number of HDFS. You can adjust the values based on your environment or use a simplified method. For example:
Change sqls_path = hdfs://server1:9000/omnicache_0919/tpcds to sqls_path = /omnicache_0919/tpcds.
In the configuration file, except the addresses of the two JAR packages, other addresses are HDFS paths by default.
Item |
Description |
Default Value |
|---|---|---|
schema_path |
Path for storing the table structure |
- |
database |
Name of the database used for training |
- |
q_log_start_time |
Start time of the original SQL query |
- |
q_log_end_time |
End time of the original SQL query |
- |
q_mv_log_start_time |
Start time of the rewritten SQL query using a materialized view |
- |
q_mv_log_end_time |
End time of the rewritten SQL query using a materialized view |
- |
logparser_jar_path |
Path of the log parser JAR package |
- |
cache_plugin_jar_path |
Path of the OmniCache plugin JAR package |
- |
hdfs_input_path |
Path for storing historical Spark logs |
- |
hdfs_output_path |
Path for storing logs parsed by OmniCache |
- |
sqls_path |
SQL set to be trained |
- |
query_csv |
Path for storing query data |
- |
mv_csv |
Path for storing view data |
- |
q_mv_csv |
Path for storing query-view mappings |
- |
mv_output_path |
Output path of candidate views |
- |
q_mv_mapping_csv |
Path for storing query-view mappings |
- |
recommend_mv_csv |
Output path of the recommended view |
- |
train_rate |
Ratio of training data. You are not advised to change the value. |
0.8 |
batch_size |
Training batch size. Adjust the value based on the value of N in top N. The batch size is an integer power of 2, and N/batch size should be about 10. Besides, the value of batch_size cannot exceed 256. |
8 |
epoch |
Number of training epochs. Adjust the value based on the value of N in top N. The value majorly ranges from 30 to 100. When N is 20, set epoch to 30. When N is a larger value, increase the value of epoch relatively. |
20 |
learning_rate |
Learning rate. You are not advised to change the value. |
0.005 |
weight_decay |
Deep learning network parameter. You are not advised to change the value. |
1e-5 |
keyword_embedding_size |
Deep learning network parameter. You are not advised to change the value. |
32 |
char_embedding_size |
Deep learning network parameter. You are not advised to change the value. |
32 |
node_auxiliary_size |
Deep learning network parameter. You are not advised to change the value. |
2 |
first_hidden_size |
Deep learning network parameter. You are not advised to change the value. |
32 |
second_hidden_size |
Deep learning network parameter. You are not advised to change the value. |
32 |
drop_rate |
Deep learning network parameter. You are not advised to change the value. |
0.1 |
model_save_path_q |
Path for saving the cost estimation model. Delete this file when switching the database platform for retraining. |
- |
model_save_path_q_mv |
Path for saving the cost estimation model. Delete this file when switching the database platform for retraining. |
- |
training_data_output_path |
Path for storing the output data during model training. |
- |
batch_size_rl |
Batch size for recommendation model training. |
2 |
learning_rate_rl |
Learning rate during recommendation model training. |
0.01 |
epsilon |
Reinforcement learning parameter. You are not advised to change the value. |
0.9 |
gamma |
Reinforcement learning parameter. You are not advised to change the value. |
0.9 |
target_replace_iter |
Reinforcement learning parameter. You are not advised to change the value. |
3 |
memory_capacity |
Reinforcement learning parameter. You are not advised to change the value. |
10 |
drop_rate_rl |
Reinforcement learning parameter. You are not advised to change the value. |
0.2 |
spark_master |
Spark execution mode. |
yarn |
app_name |
Spark task name |
- |
spark_num_executors |
Spark parameter --num-executors |
30 |
spark_executor_memory |
Spark parameter --executor-memory |
32g |
spark_driver_memory |
Spark parameter --driver-memory |
48g |
executor_cores |
Spark parameter --executor-cores |
18 |
session_timeout |
Timeout period of a spark-sql session. The default unit is second. |
5000 |
cnt_limit |
Maximum number of candidate views. The default value is 40. You can increase the value to ensure that more SQL statements are covered. |
40 |
refer_threshold |
The subtree whose reference count is less than refer_threshold is not used as a candidate view. The default value is 2. You are not advised to change the value. |
2 |