Rate This Document
Findability
Accuracy
Completeness
Readability

Tuning Process Flow

Effective performance tuning begins with identifying the root cause and pinpointing the layer where the bottleneck occurs, followed by selecting targeted optimization strategies.

The analysis roadmap is as follows:

Dimensions of bottleneck analysis

  • For server performance issues, you need to focus on hardware metrics such as CPU, memory, disk I/O, and BIOS configurations. You can identify bottlenecks by analyzing the collected data.
  • For network problems, binding NIC interrupts to cores can significantly improve the network performance.

Key points for Hive optimization

Hive performance is affected by query plans and Tez task execution. During the optimization, you need to analyze both the HiveServer run logs and GC logs. The HiveServer log path is /var/log/hive on the HiveServer node. Table 1 describes the key logs.

Table 1 Key log files

File

Description

hiveserver2.log

HiveServer run log

hiveserver2-gc-2019-12-30_22-27-02.log.0.current

HiveServer GC log

Hive optimization procedure

  1. Analyze the tables and files to be processed by SQL. Collect statistics on the number of files, data volume, number of records, file format, and compression format of the tables to be processed. Check whether there are more suitable file storage formats and compression formats, whether partitions or buckets can be used, and whether a large number of small files need to be merged before the executing the map task. If optimization is possible, proceed accordingly.
    1. Determine a more appropriate file storage format.

      In the current TPC Benchmark DS (TPC-DS) test, the ORC file storage format is used, which provides the optimal performance. This storage format is recommended.

    2. Determine a more appropriate compression format.

      In the current TPC-DS test, the Snappy data compression format is used in the map/reduce phase. This compression format provides optimal test performance currently. Therefore, this compression format is recommended.

    3. Determine whether partitions or buckets can be used.

      Large data sets may take a lot of time to process. In many scenarios, the total amount of data scanned each time can be reduced by partitioning or slicing, which can significantly improve performance. Hive partitions are implemented using the Hadoop Distributed File System (HDFS) subdirectory function. Each subdirectory contains the column name and value of each column corresponding to the partition. However, HDFS does not support a large number of subdirectories, which limits the usage of partitions. It is necessary to estimate the number of partitions in a table to avoid a series of problems caused by an excessively large number of partitions. Hive queries usually use partition columns as query conditions. In this way, the MapReduce job can be specified to complete scanning in the specified subdirectory in the HDFS. The HDFS file directory structure can be used as efficiently as indexes.

      For example:

      1
      2
      3
      4
      5
      CREATE TABLE logs(
      timestamp BIGINT,
      line STRING
      )
      PARTITIONED BY (date STRING,country STRING);
      

      The columns defined in the PARTITIONED BY clause are formal columns (partitioned columns) in the table, but the data file does not contain these columns.

      Hive can also organize tables or partitions into buckets. The purposes of organizing tables or partitions into buckets are as follows:

      The first purpose is to see samples more efficiently, since it may not be realistic to process all the data in the development and testing phases when processing large data sets. In this case, sampling is essential.

      The second purpose is to achieve better query processing efficiency.

      Buckets provide additional structures for tables. Hive uses these structures when processing certain queries, which improves query efficiency.

      Buckets are implemented by performing hash calculation on a specified column. Data under a column name is divided into a group of buckets by using the hash value, and each bucket corresponds to a storage file under the column name.

      Before creating buckets, set hive.enforce.bucketing to true so that Hive can identify the buckets.

      The following is the statement for creating a table with buckets. Replace bucketed_user with the actual table name.

      1
      2
      3
      4
      5
      CREATE TABLE bucketed_user(
      id INT,
      name String
      )
      CLUSTERED BY (id) INTO 4 BUCKETS;
      

      Data in a partition can be further split into buckets. Unlike a partition that directly splits a column, a bucket usually uses the hash value of the column for data sampling.

      If the number of partitions is too large, the file system may break down. In this case, you are advised to use buckets.

    4. Determine whether a large number of small files need to be merged before executing the map task.

      Check the size of each file in each table. If there are a large number of small files (the file size is less than the block size), you are advised to merge the small files before mapping. The combination statement is as follows. Replace bucketed_user with the actual table name.

      ALTER TABLE bucketed_user CONCATENATE;
  2. Identify the performance bottleneck based on the data captured by Nmon. Modify parameters such as hive.tez.container.size based on the CPU and memory usage.