Rate This Document
Findability
Accuracy
Completeness
Readability

Spark Configuration File

Before running Spark, you need to modify the BoostKit-omnimv_1.1.0/config/omnimv_config_spark.cfg file. An example of the configuration file is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
[schema]
# Schema SQL path
schema_path = hdfs://server1:9000/omnimv/schema.sql
table_bytes_path = hdfs://server1:9000/omnimv/table_bytes
[spark]
# Database name
database = tpcds_bin_partitioned_decimal_orc_10
# Time segment for executing the query in Spark on Yarn
q_log_start_time = 2023-04-03 11:21
q_log_end_time = 2023-04-03 13:55
# Time segment for executing the views created by the OmniMV plugin in Spark on Yarn
mv_log_start_time = 2023-03-31 16:05
mv_log_end_time = 2023-03-31 16:19
# Path of the log parser JAR package
logparser_jar_path = /opt/spark/boostkit-omnimv-logparser-spark-3.1.1-1.1.0.jar
# Path of the OmniMV plugin
cache_plugin_jar_path = /opt/spark/boostkit-omnimv-spark-3.1.1-1.1.0.jar
# Spark History output path configured in the spark-defaults.conf file
hdfs_input_path = hdfs://server1:9000/spark2-history
# Path for storing all logs after log parsing. Each log is parsed into the JSON format.
hdfs_output_path = hdfs://server1:9000/omnimv/spark2-history-json
# Path for storing SQL statements. The following path is an example of TPC-DS SQL.
sqls_path = hdfs://server1:9000/omnimv/tpcds
#Threshold of the table size. If the table size exceeds the threshold, the table is considered as a large table. The default threshold is 1 GB, that is, 1000000000.
large_table_threshold = 5000000
# Threshold for the number of candidate views. Only the top N candidate views are selected.
mv_limit = 10
[result]
# Path for storing candidate views. Subdirectories are generated in the directory: mv_sql (storing all candidate views), top_mv (storing top N candidate views), and mv_recommend (storing the recommended view).
mv_output_path = hdfs://server1:9000/omnimv/mv
[train]
# Path of the cost evaluation model
spark_model_path = hdfs://server1:9000/omnimv/training/spark_model.pkl
epochs = 200
estimators = 50
max_depth = 5
learning_rate = 0.05
[yarn]
# Spark execution mode. The value can be yarn or local.
spark_master = yarn
# Parameter following the --name Spark parameter
app_name = omnimv
# Parameter specified by the --num-executors Spark parameter
spark_num_executors = 30
# Parameter specified by the --executor-memory Spark parameter
spark_executor_memory = 32g
# Parameter specified by the --driver-memory Spark parameter
spark_driver_memory = 48g
# Parameter specified by the --executor-cores Spark parameter
executor_cores = 18
# Timeout duration of the Spark SQL session, in seconds. The default value is 5000.
session_timeout = 5000

hdfs://server1:9000 indicates that data is cached in HDFS. server1 indicates the host name of the current server, and 9000 indicates the port number of HDFS. You can adjust the values based on your environment or use a simplified method. Example:

  • Change sqls_path=hdfs://server1:9000/omnimv/tpcds to sqls_path=/omnimv/tpcds.
  • In the configuration file, except the addresses of the two JAR packages, other addresses are HDFS paths by default.
Table 1 Spark configuration file

Parameter

Description

Default Value

schema_path

Path for storing the table structure

-

table_bytes_path

Path for storing table sizes

-

database

Name of the database used for training

-

q_log_start_time

Start time of the original SQL query

-

q_log_end_time

End time of the original SQL query

-

mv_log_start_time

Start time of the candidate materialized view

-

mv_log_end_time

End time of the candidate materialized view

-

logparser_jar_path

Path of the log parser JAR package

-

cache_plugin_jar_path

Path of the OmniMV plugin JAR package

-

hdfs_input_path

Path for storing historical Spark logs

-

hdfs_output_path

Path for storing logs parsed by OmniMV

-

large_table_threshold

Threshold of the table size. If the table size exceeds the threshold, the table is considered as a large table. The default threshold is 1 GB, that is, 1000000000.

1000000000

mv_limit

Threshold for the number of candidate views. Only the top N candidate views are selected.

10

mv_output_path

Path for storing candidate views. Subdirectories are generated in the directory: mv_sql (storing all candidate views), top_mv (storing top N candidate views), and mv_recommend (storing the recommended view).

-

spark_model_path

Output path of the recommended view

-

epochs

Number of model training epochs

200

learning_rate

Learning rate of model training

0.005

estimators

Number of gradient boosting trees. This is a hyperparameter of the model.

50

max_depth

Depth of gradient boosting trees. This is a hyperparameter of the model.

5

spark_master

Spark execution mode

yarn

app_name

Spark task name

omnimv

spark_num_executors

Spark parameter --num-executors

30

spark_executor_memory

Spark parameter --executor-memory

32 GB

spark_driver_memory

Spark parameter --driver-memory

48 GB

executor_cores

Spark parameter --executor-cores

18

session_timeout

Timeout duration of a Spark SQL session. The default unit is second.

5000