Rate This Document
Findability
Accuracy
Completeness
Readability

Using OmniCache to Recommend Materialized Views

In this example, the TPC-DS data table is used as the test table.

  1. Generate test data.
    1. Use the hive-testbench tool to generate 8 TB TPC-DS data. This tool is an open source Hive test tool developed by Hortonworks. You can download the tool from https://github.com/hortonworks/hive-testbench or run the following command:
      wget https://github.com/hortonworks/hive-testbench/archive/hdp3.zip --no-check-certificate
    2. Install Maven. Download the installation package and install Maven to the specified directory (for example /opt/tools/installed/).
      wget https://archive.apache.org/dist/maven/maven-3/3.5.4/binaries/apache-maven-3.5.4-bin.tar.gz
      tar -zxf apache-maven-3.5.4-bin.tar.gz
      mv apache-maven-3.5.4 /opt/tools/installed/
    3. Modify the Maven environment variables.
      1. Open the /etc/profile file.
        vim /etc/profile
      2. Press i to enter the insert mode and add the following content to the end of the file:
        export MAVEN_HOME=/opt/tools/installed/apache-maven-3.5.4
        export PATH=$MAVEN_HOME/bin:$PATH
      3. Press Esc, type :wq!, and press Enter to save the file and exit.
      4. Make the environment variables take effect.
        source /etc/profile
    4. Check whether Maven is successfully installed.
      mvn -v
  2. Compile TPC-DS.
    1. Go to the tpcds-gen directory.
      cd tpcds-gen
    2. Open the pom file.
      vim pom.xml
    3. Press i to enter the insert mode, and change the Hadoop version in the <dependencies> tag to the actual version used in your environment. For example, the Hadoop version is 3.1.1 in this document while the open source version is 2.2.0.

      Before compilation, ensure that the network connection of your environment is normal so that packages can be downloaded. You can also perform compilation in an environment whose network connection is normal and then place the files generated after the compilation in a non-networked environment.

    4. Press Esc, type :wq!, and press Enter to save the file and exit.
    5. Compile TPC-DS. If an error is reported due to certificate issues during the installation, rectify the fault by referring to An Error Is Reported During TPC-DS Compilation.
      cd /opt/hive-testbench-hdp3/
      sh tpcds-build.sh
  3. Modify the TPC-DS script.

    This script is applicable to HDP3. When Apache is used, modify tpcds-setup.sh as follows:

    1. Change the Beeline command.
      1. Open the /opt/hive-testbench-hdp3/tpcds-setup.sh script.
        vim /opt/hive-testbench-hdp3/tpcds-setup.sh
      2. Press i to enter the insert mode. Change the command based on your environment requirements. To facilitate the test, HIVE is set to "hive" in this example.

      3. Press Esc, type :wq!, and press Enter to save the file and exit.
    2. Comment out the hive.optimize.sort.dynamic.partition.threshold parameter.
      1. Open the /opt/hive-testbench-hdp3/settings/load-partitioned.sql file.
        vim /opt/hive-testbench-hdp3/settings/load-partitioned.sql
      2. Press i to enter the insert mode and comment out (#) the last line.
      3. Press Esc, type :wq!, and press Enter to save the file and exit.
    3. Comment out the set hive.optimize.sort.dynamic.partition.threshold=0 parameter.
      1. Open the /opt/hive-testbench-hdp3/settings/load-flat.sql file.
        vim /opt/hive-testbench-hdp3/settings/load-flat.sql
      2. Press i to enter the insert mode and comment out (#) the last line.

        See the following figure:

      3. Press Esc, type :wq!, and press Enter to save the file and exit.
  4. Generate TPC-DS data.

    Test the generation of 8 TB data. (Generally, 8000 GB data can be used for the Hive performance test.)

    sh tpcds-setup.sh 8000
  5. Modify the configuration file.

    Modify the config/omnicache_config.cfg file. Example:

    # Database name
    database = tpcds_bin_partitioned_decimal_orc_8000
    # Path to the log parser JAR package
    logparser_jar_path = /opt/omnicache/boostkit-omnicache-logparser-spark-3.1.1-1.0.0-aarch64.jar
    # Path to the OmniCache plugin
    cache_plugin_jar_path = /opt/omnicache/boostkit-omnicache-spark-3.1.1-1.0.0-aarch64.jar
    # Spark history output path configured in the spark-defaults.conf file
    hdfs_input_path = hdfs://server1:9000/spark2-history
    # Path for storing SQL statements. The following path is an example.
    sqls_path = hdfs://server1:9000/omnicache_0919/tpcds
    Table 1 OmniCache configuration items

    Configuration Item

    Description

    Reference Value

    database

    Name of the database used for the test

    -

    logparser_jar_path

    Path to the log parser JAR package

    /opt/omnicache/boostkit-omnicache-logparser-spark-3.1.1-1.0.0-aarch64.jar

    cache_plugin_jar_path

    Path to the OmniCache JAR package

    /opt/omnicache/boostkit-omnicache-spark-3.1.1-1.0.0-aarch64.jar

    hdfs_input_path

    Address for accessing Spark task run logs in HDFS

    hdfs://server1:9000/spark2-history

    sqls_path

    Address for accessing the test SQL statements in HDFS

    hdfs://server1:9000/omnicache_0919/tpcds

  6. Obtain test SQL statements (for example, TPC-DS).
    1. Download the Spark source code and decompress it. Obtain the TPC-DS test SQL statements from spark-master/sql/core/src/test/resources/tpcds/.
    2. Upload the test SQL statements to HDFS (the path is the value of sqls_path in the configuration file).
      hdfs dfs -mkdir -p hdfs://server1:9000/omnicache_0919/tpcds
      hdfs dfs -put /Spark_source_code_directory/sql/core/src/test/resources/tpcds/* hdfs://server1:9000/omnicache_0919/tpcds/
  7. Initialize the database. This is to obtain the table creation information of the target table for creating recommended views.

    If the message "init_database succeed!" is displayed, the execution is successful.

    python3 main.pyc init_database

  8. Run the test SQL statements. (The TPC-DS test SQL statements are used as an example.)

    If the message "run_original_sqls succeed!" is displayed, the execution is successful.

    python3 main.pyc run_original_sqls

  9. Parse run logs.
    1. Use the http://ip:18080 address to access the Spark history server and view the start time and end time of the native SQL task.
    2. Modify the config/omnicache_config.cfg file.
      # Execution start time of the test SQL statements
      q_log_start_time = 2022-09-15 11:58
      # Execution end time of the test SQL statements
      q_log_end_time = 2022-09-15 17:15
    3. Run the script for parsing the logs.

      If the message "parse_query_logs succeed!" is displayed, the execution is successful.

      python3 main.pyc parse_query_logs

  10. Generate candidate views.
    1. Optional: Modify the config/omnicache_config.cfg file.
      1. Optional: Open the file.
        vim config/omnicache_config.cfg
      2. Press i to enter the insert mode and modify the following content of the file:
        # Maximum number of top candidate materialized views. The number of recommended materialized views is less than or equal to the value of this parameter.
        cnt_limit = 40
      3. Press Esc, type :wq!, and press Enter to save the file and exit.
    2. Run the script.
      If the message "generate_candidate_view_spark succeed!" is displayed, the execution is successful.
      python3 main.pyc generate_views

  11. Create top N materialized views. (Manual confirmation is required to determine whether to create candidate views.)

    If the message "create_top_views succeed!" is displayed, the execution is successful.

    python3 main.pyc create_top_views

  12. Run the materialized view rewriting task.

    If the message "run_rewrite_sqls succeed!" is displayed, the execution is successful.

    python3 main.pyc run_rewrite_sqls

  13. Parse the rewriting task logs.
    1. Use the http://ip:18080 address to access the Spark history server and view the start time and end time of the rewritten SQL task.
    2. Modify the config/omnicache_config.cfg file.
      1. Open the file.
        vim config/omnicache_config.cfg
      2. Press i to enter the insert mode and modify the following content of the file:
        # Start time of the rewriting task
        q_mv_log_start_time = 2022-09-15 20:18
        # End time of the rewriting task
        q_mv_log_end_time = 2022-09-15 21:17
      3. Press Esc, type :wq!, and press Enter to save the file and exit.
    3. Run the script.

      If the message "parse_query_mv_logs succeed!" is displayed, the execution is successful.

      python3 main.pyc parse_query_mv_logs

  14. Train the cost estimation model.

    If the message "train_cost_estimation_model succeed!" is displayed, the execution is successful.

    python3 main.pyc cost_estimation

  15. Train the recommendation model.

    If the message "train_recommend_model succeed!" is displayed, the execution is successful.

    python3 main.pyc recommend

  16. Obtain recommended materialized views.
    If the message "get_recommend_mvs succeed!" is displayed, the execution is successful.
    python3 main.pyc get_recommend_mvs
  17. Create recommended materialized views. (Manual confirmation is required to determine whether to create recommended views.)
    If the message "create_recommend_views succeed!" is displayed, the execution is successful.
    python3 main.pyc create_recommend_views