为什么在Spark SQL写完之后Impala无法读取实木复合地板文件? [英] Why can't Impala read parquet files after Spark SQL's write?

查看:331
本文介绍了为什么在Spark SQL写完之后Impala无法读取实木复合地板文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark解释镶木地板的列的方式存在一些问题.

Having some issues with the way that Spark is interpreting columns for parquet.

我有一个具有已确认架构的Oracle源(df.schema()方法):

I have an Oracle source with confirmed schema (df.schema() method):

root
  |-- LM_PERSON_ID: decimal(15,0) (nullable = true)
  |-- LM_BIRTHDATE: timestamp (nullable = true)
  |-- LM_COMM_METHOD: string (nullable = true)
  |-- LM_SOURCE_IND: string (nullable = true)
  |-- DATASET_ID: decimal(38,0) (nullable = true)
  |-- RECORD_ID: decimal(38,0) (nullable = true)

然后将其另存为Parquet-df.write().parquet()方法-具有相应的消息类型(由Spark确定):

Which is then saved as Parquet - df.write().parquet() method - with corresponding message type (determined by Spark):

  message spark_schema {
    optional int64 LM_PERSON_ID (DECIMAL(15,0));
    optional int96 LM_BIRTHDATE;
    optional binary LM_COMM_METHOD (UTF8);
    optional binary LM_SOURCE_IND (UTF8);
    optional fixed_len_byte_array(16) DATASET_ID (DECIMAL(38,0));
    optional fixed_len_byte_array(16) RECORD_ID (DECIMAL(38,0));
}

然后,我的应用程序使用HashMap生成表DDL以进行类型转换,例如:

My application then generates the table DDL using a HashMap for type conversion, for example:

CREATE EXTERNAL TABLE IF NOT EXISTS 
ELM_PS_LM_PERSON (
LM_PERSON_ID DECIMAL(15,0)
,LM_BIRTHDATE TIMESTAMP
,LM_COMM_METHOD STRING
,LM_SOURCE_IND STRING
,DATASET_ID DECIMAL(38,0)
,RECORD_ID DECIMAL(38,0)
) PARTITIONED BY (edi_business_day STRING) STORED AS PARQUET LOCATION '<PATH>'

我的问题是Impala无法读取该表,因为它不接受LM_PERSON_ID作为十进制字段.如果此列设置为BIGINT,则该表仅读取镶木地板文件.

My issue is that the table will fail to be read by Impala because it will not accept LM_PERSON_ID as a decimal field. The table will only read the parquet file if this column is set to BIGINT.

Query 8d437faf6323f0bb:b7ba295d028c8fbe: 0% Complete (0 out of 1)
File 'hdfs:dev/ELM/ELM_PS_LM_PERSON/part-00000-fcdbd3a5-9c93-490e-a124-c2a327a17a17.snappy.parquet' has an incompatible Parquet schema for column 'rbdshid1.elm_ps_lm_person_2.lm_person_id'. 
Column type: DOUBLE, Parquet schema:
optional int64 LM_PERSON_ID [i:0 d:1 r:0]

我怎么知道什么时候用十进制字段代替BIGINT?

How do I know when to substitute a Decimal field for BIGINT?

实木复合地板消息类型已记录但无法访问?

The parquet message type is logged but not accessible?

将两个小数字段转换为fixed_len_byte_array(16),将LM_PERSON_ID转换为int64

Two decimal fields are converted to fixed_len_byte_array(16), LM_PERSON_ID is converted to int64

我能想到的唯一解决方案是创建表,测试它是否返回,如果不返回,则将十进制字段逐一替换为BIGINT,每次都测试.

The only resolution I can think of is to create the table, test if it returns, if not drop and substitute decimal fields to BIGINT one by one, testing each time.

我在这里想念什么?我可以为镶木地板文件强制使用十进制格式吗?

What am I missing here? Can I enforce a schema for the parquet file for decimal?

推荐答案

来自 Parquet的="nofollow noreferrer">配置部分文件在Apache Spark的官方文档中:

From Configuration section of Parquet Files in the official documentation of Apache Spark:

spark.sql.parquet.writeLegacyFormat (默认值:false)

如果为true,则将以Spark 1.4及更早版本的方式写入数据.例如,十进制值将以Apache Parquet的固定长度字节数组格式编写,而其他系统(例如Apache Hive和Apache Impala)都将使用十进制值.如果为false,将使用Parquet中较新的格式.例如,小数将以基于int的格式编写.如果Parquet输出打算用于不支持这种较新格式的系统,请设置为true.

If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values will be written in Apache Parquet's fixed-length byte array format, which other systems such as Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For example, decimals will be written in int-based format. If Parquet output is intended for use with systems that do not support this newer format, set to true.

在正式文档更新之前给出的答案

由Spark编写的非常相似的 SPARK-20297实木复合地板小数(12,2)是Hive和Impala无法读取不久(20/Apr/17 01:59)被解决为不是问题".

Answer Given Before Official Docs Got Updated

The very similar SPARK-20297 Parquet Decimal(12,2) written by Spark is unreadable by Hive and Impala was quite lately (20/Apr/17 01:59) resolved as Not A Problem.

主要要点是使用spark.sql.parquet.writeLegacyFormat属性并以旧式格式编写实木复合地板元数据(在官方文档中

The main point is to use spark.sql.parquet.writeLegacyFormat property and write a parquet metadata in a legacy format (which I don't see described in the official documentation under Configuration and reported as an improvement in SPARK-20937).

启用spark.sql.parquet.writeLegacyFormat时,Hive和Impala可以读取Spark写入的数据.

Data written by Spark is readable by Hive and Impala when spark.sql.parquet.writeLegacyFormat is enabled.

它确实遵循更新的标准- https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal ,我错过了文档. 难道不是Impala或Hive中的bug?

It does follow the newer standard - https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal and I missed the documentation. Wouldn't it be then bugs in Impala or Hive?

int32/int64选项存在于十进制规范的原始版本中,但并未得到广泛实现: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 .因此,它不是新旧版本,只是许多系统未实现的替代表示形式.

The int32/int64 options were present in the original version of the decimal spec, they just weren't widely implemented: https://github.com/Parquet/parquet-format/commit/b2836e591da8216cfca47075baee2c9a7b0b9289 . So its not a new/old version thing, it was just an alternative representation that many systems didn't implement.

SPARK-10400 也是很有帮助的读物(关于spark.sql.parquet.writeLegacyFormat属性的历史):

This SPARK-10400 can also be a quite helpful reading (about the history of spark.sql.parquet.writeLegacyFormat property):

在致力于在SPARK-6777中实现Parquet向后兼容规则时,我们引入了SQL选项"spark.sql.parquet.followParquetFormatSpec".它指示我们应该使用Spark 1.4和更早版本采用的旧式Parquet格式,还是使用镶木地板格式规范中定义的标准格式.但是,此选项的名称有些令人困惑,因为我们不应该遵循该规范并不是很直观.最好将其重命名为"spark.sql.parquet.writeLegacyFormat"并反转其默认值(它们具有相反的含义).请注意,此选项不是"public"(isPublic为false).

We introduced SQL option "spark.sql.parquet.followParquetFormatSpec" while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec. However, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to "spark.sql.parquet.writeLegacyFormat" and invert its default value (they have opposite meanings). Note that this option is not "public" (isPublic is false).

这篇关于为什么在Spark SQL写完之后Impala无法读取实木复合地板文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆