Spark Configuration File
Before running Spark, you need to modify the BoostKit-omnimv_1.1.0/config/omnimv_config_spark.cfg file. An example of the configuration file is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | [schema] # Schema SQL path schema_path = hdfs://server1:9000/omnimv/schema.sql table_bytes_path = hdfs://server1:9000/omnimv/table_bytes [spark] # Database name database = tpcds_bin_partitioned_decimal_orc_10 # Time segment for executing the query in Spark on Yarn q_log_start_time = 2023-04-03 11:21 q_log_end_time = 2023-04-03 13:55 # Time segment for executing the views created by the OmniMV plugin in Spark on Yarn mv_log_start_time = 2023-03-31 16:05 mv_log_end_time = 2023-03-31 16:19 # Path of the log parser JAR package logparser_jar_path = /opt/spark/boostkit-omnimv-logparser-spark-3.1.1-1.1.0.jar # Path of the OmniMV plugin cache_plugin_jar_path = /opt/spark/boostkit-omnimv-spark-3.1.1-1.1.0.jar # Spark History output path configured in the spark-defaults.conf file hdfs_input_path = hdfs://server1:9000/spark2-history # Path for storing all logs after log parsing. Each log is parsed into the JSON format. hdfs_output_path = hdfs://server1:9000/omnimv/spark2-history-json # Path for storing SQL statements. The following path is an example of TPC-DS SQL. sqls_path = hdfs://server1:9000/omnimv/tpcds #Threshold of the table size. If the table size exceeds the threshold, the table is considered as a large table. The default threshold is 1 GB, that is, 1000000000. large_table_threshold = 5000000 # Threshold for the number of candidate views. Only the top N candidate views are selected. mv_limit = 10 [result] # Path for storing candidate views. Subdirectories are generated in the directory: mv_sql (storing all candidate views), top_mv (storing top N candidate views), and mv_recommend (storing the recommended view). mv_output_path = hdfs://server1:9000/omnimv/mv [train] # Path of the cost evaluation model spark_model_path = hdfs://server1:9000/omnimv/training/spark_model.pkl epochs = 200 estimators = 50 max_depth = 5 learning_rate = 0.05 [yarn] # Spark execution mode. The value can be yarn or local. spark_master = yarn # Parameter following the --name Spark parameter app_name = omnimv # Parameter specified by the --num-executors Spark parameter spark_num_executors = 30 # Parameter specified by the --executor-memory Spark parameter spark_executor_memory = 32g # Parameter specified by the --driver-memory Spark parameter spark_driver_memory = 48g # Parameter specified by the --executor-cores Spark parameter executor_cores = 18 # Timeout duration of the Spark SQL session, in seconds. The default value is 5000. session_timeout = 5000 |
hdfs://server1:9000 indicates that data is cached in HDFS. server1 indicates the host name of the current server, and 9000 indicates the port number of HDFS. You can adjust the values based on your environment or use a simplified method. Example:
- Change sqls_path=hdfs://server1:9000/omnimv/tpcds to sqls_path=/omnimv/tpcds.
- In the configuration file, except the addresses of the two JAR packages, other addresses are HDFS paths by default.
Parameter |
Description |
Default Value |
|---|---|---|
schema_path |
Path for storing the table structure |
- |
table_bytes_path |
Path for storing table sizes |
- |
database |
Name of the database used for training |
- |
q_log_start_time |
Start time of the original SQL query |
- |
q_log_end_time |
End time of the original SQL query |
- |
mv_log_start_time |
Start time of the candidate materialized view |
- |
mv_log_end_time |
End time of the candidate materialized view |
- |
logparser_jar_path |
Path of the log parser JAR package |
- |
cache_plugin_jar_path |
Path of the OmniMV plugin JAR package |
- |
hdfs_input_path |
Path for storing historical Spark logs |
- |
hdfs_output_path |
Path for storing logs parsed by OmniMV |
- |
large_table_threshold |
Threshold of the table size. If the table size exceeds the threshold, the table is considered as a large table. The default threshold is 1 GB, that is, 1000000000. |
1000000000 |
mv_limit |
Threshold for the number of candidate views. Only the top N candidate views are selected. |
10 |
mv_output_path |
Path for storing candidate views. Subdirectories are generated in the directory: mv_sql (storing all candidate views), top_mv (storing top N candidate views), and mv_recommend (storing the recommended view). |
- |
spark_model_path |
Output path of the recommended view |
- |
epochs |
Number of model training epochs |
200 |
learning_rate |
Learning rate of model training |
0.005 |
estimators |
Number of gradient boosting trees. This is a hyperparameter of the model. |
50 |
max_depth |
Depth of gradient boosting trees. This is a hyperparameter of the model. |
5 |
spark_master |
Spark execution mode |
yarn |
app_name |
Spark task name |
omnimv |
spark_num_executors |
Spark parameter --num-executors |
30 |
spark_executor_memory |
Spark parameter --executor-memory |
32 GB |
spark_driver_memory |
Spark parameter --driver-memory |
48 GB |
executor_cores |
Spark parameter --executor-cores |
18 |
session_timeout |
Timeout duration of a Spark SQL session. The default unit is second. |
5000 |