Spark Configuration File

Before running Spark, you need to modify the BoostKit-omnimv_1.1.0/config/omnimv_config_spark.cfg file. An example of the configuration file is as follows:

[schema]
# Schema SQL path
schema_path = hdfs://server1:9000/omnimv/schema.sql
table_bytes_path = hdfs://server1:9000/omnimv/table_bytes
[spark]
# Database name
database = tpcds_bin_partitioned_decimal_orc_10
# Time segment for executing the query in Spark on Yarn
q_log_start_time = 2023-04-03 11:21
q_log_end_time = 2023-04-03 13:55
# Time segment for executing the views created by the OmniMV plugin in Spark on Yarn
mv_log_start_time = 2023-03-31 16:05
mv_log_end_time = 2023-03-31 16:19
# Path of the log parser JAR package
logparser_jar_path = /opt/spark/boostkit-omnimv-logparser-spark-3.1.1-1.1.0.jar
# Path of the OmniMV plugin
cache_plugin_jar_path = /opt/spark/boostkit-omnimv-spark-3.1.1-1.1.0.jar
# Spark History output path configured in the spark-defaults.conf file
hdfs_input_path = hdfs://server1:9000/spark2-history
# Path for storing all logs after log parsing. Each log is parsed into the JSON format.
hdfs_output_path = hdfs://server1:9000/omnimv/spark2-history-json
# Path for storing SQL statements. The following path is an example of TPC-DS SQL.
sqls_path = hdfs://server1:9000/omnimv/tpcds
#Threshold of the table size. If the table size exceeds the threshold, the table is considered as a large table. The default threshold is 1 GB, that is, 1000000000.
large_table_threshold = 5000000
# Threshold for the number of candidate views. Only the top N candidate views are selected.
mv_limit = 10
[result]
# Path for storing candidate views. Subdirectories are generated in the directory: mv_sql (storing all candidate views), top_mv (storing top N candidate views), and mv_recommend (storing the recommended view).
mv_output_path = hdfs://server1:9000/omnimv/mv
[train]
# Path of the cost evaluation model
spark_model_path = hdfs://server1:9000/omnimv/training/spark_model.pkl
epochs = 200
estimators = 50
max_depth = 5
learning_rate = 0.05
[yarn]
# Spark execution mode. The value can be yarn or local.
spark_master = yarn
# Parameter following the --name Spark parameter
app_name = omnimv
# Parameter specified by the --num-executors Spark parameter
spark_num_executors = 30
# Parameter specified by the --executor-memory Spark parameter
spark_executor_memory = 32g
# Parameter specified by the --driver-memory Spark parameter
spark_driver_memory = 48g
# Parameter specified by the --executor-cores Spark parameter
executor_cores = 18
# Timeout duration of the Spark SQL session, in seconds. The default value is 5000.
session_timeout = 5000

hdfs://server1:9000 indicates that data is cached in HDFS. server1 indicates the host name of the current server, and 9000 indicates the port number of HDFS. You can adjust the values based on your environment or use a simplified method. Example:

Change sqls_path=hdfs://server1:9000/omnimv/tpcds to sqls_path=/omnimv/tpcds.
In the configuration file, except the addresses of the two JAR packages, other addresses are HDFS paths by default.

**Table 1** Spark configuration file
Parameter	Description	Default Value
schema_path	Path for storing the table structure	-
table_bytes_path	Path for storing table sizes	-
database	Name of the database used for training	-
q_log_start_time	Start time of the original SQL query	-
q_log_end_time	End time of the original SQL query	-
mv_log_start_time	Start time of the candidate materialized view	-
mv_log_end_time	End time of the candidate materialized view	-
logparser_jar_path	Path of the log parser JAR package	-
cache_plugin_jar_path	Path of the OmniMV plugin JAR package	-
hdfs_input_path	Path for storing historical Spark logs	-
hdfs_output_path	Path for storing logs parsed by OmniMV	-
large_table_threshold	Threshold of the table size. If the table size exceeds the threshold, the table is considered as a large table. The default threshold is 1 GB, that is, 1000000000.	1000000000
mv_limit	Threshold for the number of candidate views. Only the top N candidate views are selected.	10
mv_output_path	Path for storing candidate views. Subdirectories are generated in the directory: mv_sql (storing all candidate views), top_mv (storing top N candidate views), and mv_recommend (storing the recommended view).	-
spark_model_path	Output path of the recommended view	-
epochs	Number of model training epochs	200
learning_rate	Learning rate of model training	0.005
estimators	Number of gradient boosting trees. This is a hyperparameter of the model.	50
max_depth	Depth of gradient boosting trees. This is a hyperparameter of the model.	5
spark_master	Spark execution mode	yarn
app_name	Spark task name	omnimv
spark_num_executors	Spark parameter --num-executors	30
spark_executor_memory	Spark parameter --executor-memory	32 GB
spark_driver_memory	Spark parameter --driver-memory	48 GB
executor_cores	Spark parameter --executor-cores	18
session_timeout	Timeout duration of a Spark SQL session. The default unit is second.	5000

Parent topic: OmniMV Configuration File