我要评分
获取效率
正确性
完整性
易理解

(Offline for 24.0.RC1) HDFS URL Fails to Be Parsed If the Partition Field in an ORC Community Release Contains Special Characters

Symptom

When an OmniOperator job is being executed, if the partition field is set for the Hive dataset in ORC/Parquet format, and the partition field is of the string, char, or varchar type and contains one or more special characters (among !#$%&()*+,-./:;<=>?@[\]^_`{|}~), the job fails to be executed and the following error logs are generated. In addition, the job has no execution result.

Error log 1: The file does not exist.

1
2
3
4
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.util.TaskCompletionListenerException: delete nullptr error for reader

Previous exception in task: Can't open/user/hive/warehouse/tpcds_bin_ partitioned_decimal_orc_2.db/partition_null_varchar/c_varchar=7893456=bbb/000000_0. status code: 2, message: File does not exist: /user/hive/warehouse/tpcds_bin_partitioned_decimal_orc_2.db/partition_null_varchar/c_varchar=7893456=bbb/000000_0

Error log 2: The HDFS URL format is incorrect in ORC format.

1
2
3
4
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.util.TaskCompletionListenerException: delete nullptr error for reader

Previous exception in task: Malformed URL: hdfs: //server1:9000/user/hive/warehouse/tpcds_bin_partitioned_decimal_orc_2.db/partition_null_varchar/c_varchar=3333|bbb/000000_0

Error log 3: The HDFS URL format is incorrect in Parquet format.

1
2
3
Previous exception in task: IOError: Invalid: Cannot parse URI: 'hdfs://server1:9000/user/hive/warehouse/mytest.db/partition_null_varchar/c_varchar=1233456|/000000_0'
/home/code/arrow/cpp/src/arrow/filesystem/filesystem.cc:750  ParseFileSystemUri(uri_string)
	com.huawei.boostkit.spark.jni.ParquetColumnarBatchJniReader.initializeReader(Native Method)

Key Process and Cause Analysis

This is a bug of the third-party library uriparser on which the open source ORC/Arrow component depends.

  • The HDFS URL transferred from Spark to native ORC is escaped by the OrcHdfsFile.cc file of ORC. After libhdfspp of the HDFS C++ client is used to send a request to the server, the HDFS server returns a message indicating that the file does not exist.
  • Native ORC uses the HDFS C++ client libhdfspp, and this HDFS C++ client uses a third-party dependency uriparser2 to parse the URL. If the URL contains special characters (such as |,{,},[,],<,>,'), the parsing fails. As a result, an error is reported indicating the HDFS URL format is abnormal.
  • Similarly, Arrow uses uriparser to parse URLs and so native Parquet does not support special characters. As a result, the parsing fails and an error is reported indicating the HDFS URL format is abnormal.

Conclusion and Solution

  • For the ORC format, use the liborc.so file provided in this version to avoid this problem, or use the patch (https://github.com/apache/orc/pull/1609) provided by the community to recompile the liborc.so file.
  • For the Parquet format, use the libarrow.so file provided in this version to avoid this problem, or use the patch (https://github.com/apache/arrow/pull/37490) provided by the community to recompile the libarrow.so file.