Rate This Document
Findability
Accuracy
Completeness
Readability

Using OmniAdvisor

Install OmniAdvisor to a specified directory on the management node, for example, /opt/OmniAdvisor, which is represented as $OMNIADVISOR_HOME. You can use OmniAdvisor in the OmniAdvisor CLI or through non-interactive commands. The OmniAdvisor CLI is recommended.

Using OmniAdvisor in the CLI

Using OmniAdvisor involves these steps: configure common_config.cfg, call the OmniAdvisor entry, initialize the environment configuration, parse historical task information, and use AI algorithms to sample and recommend parameters.

  1. Modify the common_config.cfg file.
    1. Open the common_config.cfg file on the management node.
      1
      vi $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/config/common_config.cfg
      
    2. Refer to the following configuration and replace example parameter values with actual ones.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      45
      46
      47
      48
      49
      50
      51
      52
      53
      54
      55
      56
      57
      58
      59
      60
      61
      62
      63
      64
      65
      66
      67
      68
      69
      70
      71
      72
      73
      [workload]
      # Name of the tested database.
      workload_name = tpcds_bin_partitioned_decimal_orc_2
      # Path for storing the files decompressed from the log parsing module. The JAR package of decompressed log files must exist in the path.
      log_analyzer_path = /opt/OmniAdvisor/boostkit-omniadvisor-log-analyzer-1.1.0-aarch64
      # Set the unique ID of the task. You can choose to use the hash value of the task name (application_name) or the hash value of the task (corresponding to the hash value of the query statement or JAR package) to search for the optimal parameters of the task in the database.
      # options: [application_name, job_hash]
      identification_type = job_hash
      # Threshold of parameter sampling rounds
      sampling_epochs = 20
      # Indicates how to process the sampling task when it times out. The value can be kill or warn.
      # options: [kill, warn]
      timeout_strategy = kill
      # Increase ratio threshold of parameter recommendation. When the difference of the historical optimal value minus the current optimal value divided by the current optimal value is greater than boosting_ratio, the optimal parameter is updated.
      boosting_ratio = 0.03
      # Timeout duration of a subprocess, in seconds.
      proc_timeout_seconds = 3600
      
      [database]
      # MySQL database information, such as the user name and port.
      # db_name: MySQL database name. If the name does not exist, it will be automatically created.
      # db_host: Name of the host connected to the MySQL database. Generally, it is localhost or the IP address connected to the MySQL database. db_port indicates the port connected to the MySQL database, and is 3306 by default.
      db_name = test
      db_host = localhost
      db_port = 3306
      
      [spark]
      # Start time of Spark run logs. You can view the date on the Spark history server (port 18080 by default).
      log_start_time = 2024-07-24 00:00:00
      # End time of Spark run logs.
      log_end_time = 2024-07-24 23:59:59
      # Default Spark parameter. Generally, the default parameter is not involved in parameter sampling.
      spark_default_config = --conf spark.sql.orc.impl=native --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.locality.wait=0 --conf spark.sql.broadcastTimeout=300
      
      [hive]
      # Start time of Tez run logs. You can view the date on the Hadoop UI (port 8088 by default).
      log_start_time = 2024-01-18 12:07:39
      # End time of Tez run logs.
      log_end_time = 2024-01-18 12:11:54
      # Default Hive parameter. Generally, the default parameter is not involved in parameter sampling.
      hive_default_config = --hiveconf hive.cbo.enable=true --hiveconf tez.am.container.reuse.enabled=true --hiveconf hive.merge.tezfiles=true
      
      [loganalyzer]
      # Number of concurrent log parsing processes, that is, number of concurrent analysis tasks.
      log.analyzer.thread.count = 3
      
      [kerberos]
      # User used for Kerberos authentication in secure mode.
      # kerberos.principal = primary/instance@REALM
      # Keytab file path used for Kerberos authentication in secure mode.
      # kerberos.keytab.file = /directory/kerbors.keytab
      
      [datasource]
      # Driver of the database used to save the analysis result after log analysis.
      datasource.db.driver = com.mysql.cj.jdbc.Driver
      
      [sparkfetcher]
      # Spark Fetcher mode, which can be log or rest.
      spark.eventlogs.mode = rest
      # URL of the Spark history server, which is used only in rest mode.
      spark.rest.url = http://server1:18080
      # Timeout duration of a Spark Fetcher analysis task, in seconds.
      spark.timeout.seconds = 30
      # Directory for storing Spark logs, which is used only in log mode.
      # spark.log.directory = hdfs://server1:9000/spark2-history
      # Maximum size of a Spark analysis log file, in MB. If the size of a log file exceeds the maximum size, the log file will be ignored. This parameter is used only in log mode.
      # spark.log.maxsize.mb = 500
      
      [tezfetcher]
      # URL of the timeline server.
      tez.timeline.url = http://server1:8188
      # Timeout duration of accessing the timeline server, in milliseconds.
      tez.timeline.timeout.ms = 6000
      
    3. Press Esc, type :wq!, and press Enter to save the file and exit.
  2. Call the OmniAdvisor entry.
    1. Call the OmniAdvisor entry on the management node to start using the OmniAdvisor feature through the OmniAdvisor CLI.
      python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc
    2. Input the MySQL user password as prompted and press Enter.

    3. Select an engine to be tuned.

      Input spark or hive, or press Tab to select an engine.

    4. Select a command to proceed.

    5. Execute the command to perform the target operation.
      Table 1 Commands and operations

      Command

      Operation

      init_environment

      Initializes the environment configuration, including database connection, database creation, and data table initialization.

      fetch_history_data

      Parses historical task information from the Spark history server or Hive timeline server.

      parameter_sampling

      Samples parameters of all historical tasks using AI algorithms.

      parameter_recommend

      Recommends parameters.

  3. Initialize the environment configuration.

    Input init_environment or press Tab to select init_environment, and then press Enter to initialize the environment.

  4. Parse historical task information.

    Input fetch_history_data or press Tab to select fetch_history_data, and then press Enter to call the log parsing module to parse historical task information.

  5. Sample parameters using AI algorithms.
    1. Input parameter_sampling or press Tab to select parameter_sampling, and then input the number of sampling rounds as prompted.

    2. Choose whether to sample all tasks in the database. By default, all tasks in the database will be sampled and tuned. If you select no, select one or more tasks to be tuned.

    3. If you select no, the CLI lists all the tasks available for tuning. Copy the identification values of the tasks that you want to tune, separate them with commas (,), and press Enter to sample the parameters of these tasks.

  6. Perform parameter recommendation.

    Input parameter_recommend or press Tab to select parameter_recommend, input the SQL statement or application submission command to recommend parameters, and submit the recommended parameters to the engine for execution.

Using OmniAdvisor Through Non-interactive Commands

On the management node, you can use OmniAdvisor options to quickly perform target operations. See Table 2.

Table 2 OmniAdvisor options and parameters

Option

Description

Example

-e, --engine

Big data engine to be tuned, which is Spark or Hive.

python main.pyc -e spark

-i, --instruction

Operation to be performed, which can be init_environment, fetch_history_data, parameter_sampling, or parameter_recommend.

python main.pyc -e spark -i init_environment

-s, --sampling_id

Sampling identifiers, which are separated by commas (,).

python main.pyc -e spark -i parameter_sampling -s "xx1"

-n, --sampling_count

Number of parameter sampling rounds.

python main.pyc -e spark -i parameter_sampling -n 10

-c, --cmd

Spark or Hive submission command of the specified parameter recommendation task.

python main.pyc -e spark -i parameter_recommend -c "spark-sql xxx"

-v, --version

Displays the OmniAdvisor version.

python main.pyc -v

--help

Displays help information.

python main.pyc --help

  1. Initialize the environment.
    python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc -e spark -i init_environment
  2. Parse historical task information.
    python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc -e spark -i fetch_history_data
  3. Sample parameters.
    Example: Perform three rounds of parameter sampling on the tasks whose identification values are xx1 and xx2.
    python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc -e spark -i parameter_sampling -s "xx1,xx2" -n 3
  4. Perform parameter recommendation.
    python $OMNIADVISOR_HOME/BoostKit-omniadvisor_1.1.0/main.pyc -e spark -i parameter_recommend -c "spark-sql xxx"