为什么在Spark SQL写完之后Impala无法读取实木复合地板文件? [英] Why can't Impala read parquet files after Spark SQL's write?

查看：331 发布时间：2020/9/4 1:31:56 java apache-spark apache-spark-sql parquet

本文介绍了为什么在Spark SQL写完之后Impala无法读取实木复合地板文件?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

Spark解释镶木地板的列的方式存在一些问题.

Having some issues with the way that Spark is interpreting columns for parquet.

我有一个具有已确认架构的Oracle源(df.schema()方法):

I have an Oracle source with confirmed schema (df.schema() method):

root
  |-- LM_PERSON_ID: decimal(15,0) (nullable = true)
  |-- LM_BIRTHDATE: timestamp (nullable = true)
  |-- LM_COMM_METHOD: string (nullable = true)
  |-- LM_SOURCE_IND: string (nullable = true)
  |-- DATASET_ID: decimal(38,0) (nullable = true)
  |-- RECORD_ID: decimal(38,0) (nullable = true)

然后将其另存为Parquet-df.write().parquet()方法-具有相应的消息类型(由Spark确定):

Which is then saved as Parquet - df.write().parquet() method - with corresponding message type (determined by Spark):

  message spark_schema {
    optional int64 LM_PERSON_ID (DECIMAL(15,0));
    optional int96 LM_BIRTHDATE;
    optional binary LM_COMM_METHOD (UTF8);
    optional binary LM_SOURCE_IND (UTF8);
    optional fixed_len_byte_array(16) DATASET_ID (DECIMAL(38,0));
    optional fixed_len_byte_array(16) RECORD_ID (DECIMAL(38,0));
}

然后，我的应用程序使用HashMap生成表DDL以进行类型转换，例如:

My application then generates the table DDL using a HashMap for type conversion, for example:

CREATE EXTERNAL TABLE IF NOT EXISTS 
ELM_PS_LM_PERSON (
LM_PERSON_ID DECIMAL(15,0)
,LM_BIRTHDATE TIMESTAMP
,LM_COMM_METHOD STRING
,LM_SOURCE_IND STRING
,DATASET_ID DECIMAL(38,0)
,RECORD_ID DECIMAL(38,0)
) PARTITIONED BY (edi_business_day STRING) STORED AS PARQUET LOCATION '<PATH>'

我的问题是Impala无法读取该表，因为它不接受LM_PERSON_ID作为十进制字段.如果此列设置为BIGINT，则该表仅读取镶木地板文件.

My issue is that the table will fail to be read by Impala because it will not accept LM_PERSON_ID as a decimal field. The table will only read the parquet file if this column is set to BIGINT.

Query 8d437faf6323f0bb:b7ba295d028c8fbe: 0% Complete (0 out of 1)
File 'hdfs:dev/ELM/ELM_PS_LM_PERSON/part-00000-fcdbd3a5-9c93-490e-a124-c2a327a17a17.snappy.parquet' has an incompatible Parquet schema for column 'rbdshid1.elm_ps_lm_person_2.lm_person_id'. 
Column type: DOUBLE, Parquet schema:
optional int64 LM_PERSON_ID [i:0 d:1 r:0]

我怎么知道什么时候用十进制字段代替BIGINT?

How do I know when to substitute a Decimal field for BIGINT?

实木复合地板消息类型已记录但无法访问?

The parquet message type is logged but not accessible?

将两个小数字段转换为fixed_len_byte_array(16)，将LM_PERSON_ID转换为int64

Two decimal fields are converted to fixed_len_byte_array(16), LM_PERSON_ID is converted to int64

我能想到的唯一解决方案是创建表，测试它是否返回，如果不返回，则将十进制字段逐一替换为BIGINT，每次都测试.

The only resolution I can think of is to create the table, test if it returns, if not drop and substitute decimal fields to BIGINT one by one, testing each time.

我在这里想念什么?我可以为镶木地板文件强制使用十进制格式吗?

What am I missing here? Can I enforce a schema for the parquet file for decimal?

在正式文档更新之前给出的答案

由Spark编写的非常相似的 SPARK-20297实木复合地板小数(12,2)是Hive和Impala无法读取不久(20/Apr/17 01:59)被解决为不是问题".

Answer Given Before Official Docs Got Updated

The very similar SPARK-20297 Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala was quite lately (20/Apr/17 01:59) resolved as Not A Problem.

主要要点是使用spark.sql.parquet.writeLegacyFormat属性并以旧式格式编写实木复合地板元数据(在官方文档中

The main point is to use spark.sql.parquet.writeLegacyFormat property and write a parquet metadata in a legacy format (which I don't see described in the official documentation under Configuration and reported as an improvement in SPARK-20937).

启用spark.sql.parquet.writeLegacyFormat时，Hive和Impala可以读取Spark写入的数据.

Data written by Spark is readable by Hive and Impala when spark.sql.parquet.writeLegacyFormat is enabled.

它确实遵循更新的标准- https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal ，我错过了文档. 难道不是Impala或Hive中的bug?

It does follow the newer standard - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and I missed the documentation. Wouldn't it be then bugs in Impala or Hive?

int32/int64选项存在于十进制规范的原始版本中，但并未得到广泛实现: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 .因此，它不是新旧版本，只是许多系统未实现的替代表示形式.

The int32/int64 options were present in the original version of the decimal spec, they just weren't widely implemented: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 . So its not a new/old version thing, it was just an alternative representation that many systems didn't implement.

此 SPARK-10400 也是很有帮助的读物(关于spark.sql.parquet.writeLegacyFormat属性的历史):

This SPARK-10400 can also be a quite helpful reading (about the history of spark.sql.parquet.writeLegacyFormat property):

在致力于在SPARK-6777中实现Parquet向后兼容规则时，我们引入了SQL选项"spark.sql.parquet.followParquetFormatSpec".它指示我们应该使用Spark 1.4和更早版本采用的旧式Parquet格式，还是使用镶木地板格式规范中定义的标准格式.但是，此选项的名称有些令人困惑，因为我们不应该遵循该规范并不是很直观.最好将其重命名为"spark.sql.parquet.writeLegacyFormat"并反转其默认值(它们具有相反的含义).请注意，此选项不是"public"(isPublic为false).

We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec. However, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they have opposite meanings). Note that this option is not "public" (isPublic is false).

这篇关于为什么在Spark SQL写完之后Impala无法读取实木复合地板文件?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

为什么在Spark SQL写完之后Impala无法读取实木复合地板文件? [英] Why can't Impala read parquet files after Spark SQL's write?

问题描述

推荐答案

在正式文档更新之前给出的答案

Answer Given Before Official Docs Got Updated

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

为什么在Spark SQL写完之后Impala无法读取实木复合地板文件? [英] Why can&#39;t Impala read parquet files after Spark SQL&#39;s write?

问题描述

推荐答案

在正式文档更新之前给出的答案

Answer Given Before Official Docs Got Updated

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

为什么在Spark SQL写完之后Impala无法读取实木复合地板文件? [英] Why can't Impala read parquet files after Spark SQL's write?

登录关闭