Bulkload Test Case

Case Analysis

In the first half of the Bulkload test case, the 48-core CPU usage of the Kunpeng computing platform is greater than 90%, and the CPU usage of the x86 computing platform is 100%. The process of analyzing Bulkload cases is as follows:

Map phase: In this phase, HFiles are generated. Depending on the data volume, tens of thousands of concurrent requests are initiated to load the data to be imported from HDFS, and then the data format is converted. After the format conversion, the data is verified to check whether the key-value pair is valid. The generated HFiles are compressed. This process consumes a large number of CPU resources. Currently, one vCore is applied for one map in MapReduce.
Reduce phase: The generated HFiles are placed to different regions based on the number of regions. The number of concurrent requests in the reduce phase is determined by the number of regions.

Map Optimization

By default, ImportTsv of Bulkload splits data files based on the block size (128 MB by default) of HDFS. For example, a 200 GB data file has more than 1600 map tasks, but no parameter is set to change the number of maps. Therefore, the ImportTsv source code is modified, the ImportTsv class is stored in hbase-mapreduce-2.0.2.3.1.0.0-78.jar of the HBase source code. The specific path is as follows:

Add a configuration parameter to ImportTsv.java, that is, add a member variable.

Add the following code to the createSubmittableJob method:

After the JAR package is recompiled, find the location of the JAR package and replace it in the HBase source code.

After the replacement, you can configure the mapreduce.split.minsize parameter on ImportTsv as follows:

      
           hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,f1:H_NAME,f1:ADDRESS -Dimporttsv.separator="," -Dimporttsv.skip.bad.lines=true -Dmapreduce.split.minsize=5368709120 -Dimporttsv.bulk.output=/tmp/hbase/hfile ImportTable /tmp/hbase/datadirImport

-Dmapreduce.split.minsize=5368709120: This value is the split value of the data file. You can change the number of maps can be set to the same value as the number of CPU cores.

Reduce Optimization

Check the task running time. It is found that the reduce running time ranges from 3 minutes and 40 seconds. Therefore, increase the number of reduce tasks to balance the data processed by each reduce task.

You can change the number of reduce tasks by changing the number of regions. (800 regions are set in this test.)

Change the value of split_num in the run_create.txt file to change the number of regions.

Sequential Data Read

In this bulkload test, the size of each record is 1 KB. The format of each record is as follows:

HBase row keys are sorted based on the ASCII dictionary. During sorting, the first bytes of two row keys are compared. If the first bytes match, the second bytes are compared. The rest may be deduced by analogy until the last bytes are compared. To ensure that data can be evenly written to each region, use the bit padding method to set the number of row key bits to the same value.

Parent topic: HBase Tuning