镶木地板格式的架构演变 [英] Schema evolution in parquet format

查看:49
本文介绍了镶木地板格式的架构演变的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我们在生产中使用 Avro 数据格式.在使用 Avro 的几个优点中,我们知道它在模式演化方面的优势.

现在我们正在评估 Parquet 格式,因为它在读取随机列时的效率.所以在继续之前,我们关心的仍然是架构演变.

有谁知道在镶木地板中是否可以进行模式演变,如果是,如何是可能的,如果不是,则为什么不可以.

一些 资源 声称这是可能的,但它只能在末尾添加列.

这是什么意思?

解决方案

架构演变可能(非常)昂贵.

为了找出架构,您基本上必须读取所有镶木地板文件并在读取期间协调/合并它们的架构,这可能会很昂贵,具体取决于数据集中有多少文件或/和多少列.

因此,从 Spark 1.5 开始,他们关闭了 模式合并.您可以随时将其重新打开).

<块引用>

由于模式合并是一个相对昂贵的操作,而不是一个在大多数情况下,我们默认关闭它从1.5.0.

如果没有架构演变,您可以从一个 parquet 文件中读取架构,而在读取其余文件时假设它保持不变.

Parquet 模式演变依赖于实现.

例如 Hive 有一个旋钮 parquet.column.index.access=false您可以设置为按列名而不是按列索引映射架构.

然后您也可以删除列,而不仅仅是添加.

正如我上面所说,它依赖于实现,例如,Impala 不会正确读取此类拼花表(在最近的 Impala 2.6 版本中已修复)[参考].>

Apache Spark,从 2.0.2 版本开始,似乎仍然只支持添加列:[参考]

<块引用>

用户可以从一个简单的schema开始,逐渐增加更多的列根据需要添加到架构中.这样,用户可能会得到多个具有不同但相互兼容的模式的 Parquet 文件.这Parquet 数据源现在能够自动检测这种情况并合并所有这些文件的模式.

PS:我看到有些人为了更灵活地更改架构而做的事情是,他们在映射两个 (或更多)不同但兼容的模式到一个公共模式.

假设您在新版本中添加了一个新字段 (registration_date) 并删除了另一列 (last_login_date),则如下所示:

CREATE VIEW datamart.unified_fact_vw作为SELECT f1..., NULL 作为 registration_dateFROM datamart.unified_fact_schema1 f1联合所有SELECT f2..., NULL 作为 last_login_date来自 datamart.unified_fact_schema2 f2;

你明白了.很高兴它可以在 Hadoop 方言上的所有 sql 中工作相同(就像我上面提到的 Hive、Impala 和 Spark),并且仍然具有 Parquet 表的所有优点(列式存储、谓词下推等).

P.P.S:添加一些有关 common_metadata 摘要文件的信息,Spark 可以创建这些文件,使这个答案更加完整.

看看 SPARK-15719

Parquet 摘要文件现在不是特别有用,因为

 - 当模式合并被禁用时,我们假设所有 Parquet 零件文件的架构都相同,因此我们可以从任何部分文件中读取页脚.- 启用模式合并时,我们需要阅读页脚无论如何都要进行合并.另一方面,编写摘要文件可能很昂贵,因为必须读取和合并所有部分文件的页脚.这在附加小数据集时特别昂贵到现有的大型 Parquet 数据集.

所以有些观点反对启用 common_metadata :

  • 当一个目录由混合了不同模式的 Parquet 文件组成时,_common_metadata 允许读者找出整个目录的合理模式,而无需阅读每个单独文件的模式.由于 Hive 和 Impala 可以从 Hive 元存储访问所述文件的 SQL 模式,因此它们可以立即开始处理单个文件,并在读取时将每个文件与 SQL 模式进行匹配,而不是事先探索它们的通用模式.这使得 Hive 和 Impala 不需要通用元数据功能.

  • 即使 Spark 在没有 SQL 模式的情况下处理 Parquet 文件(除非使用 SparkSQL),因此理论上可以从 _common_metadata 中受益,但此功能仍然被认为没有用,因此在 SPARK-15719 中默认禁用.

  • 即使这个功能对查询有用,但在编写过程中仍然是一个负担.必须维护元数据,不仅速度慢,而且容易出现竞态等并发问题,缺乏原子性保证,容易因元数据陈旧或不一致导致数据正确性问题.

  • 该功能未记录在案,似乎被视为已弃用(仅似乎是",因为它似乎从未被正式支持过,并且不能弃用不受支持的功能或者).

  • 来自 Cloudera 的一位工程师:如果 common_metadata 文件存在,我不知道读取端的行为是否发生了变化,以避免查看每个页脚.但无论如何,首先编写该文件是一个巨大的瓶颈,并给我们的客户带来了很多问题.我强烈建议他们不要费心尝试生成该元数据文件."

  • _common_metadata"和_metadata"文件是特定于 Spark 的,不是由 Impala 和 Hive 等编写的,也可能不是由其他引擎编写的.

Spark 中的摘要元数据文件可能仍然有它的用例 - 当没有并发和上述其他问题时 - 例如,一些流用例 - 我想这就是为什么这个功能没有从 Spark 中完全删除.

Currently we are using Avro data format in production. Out of several good points using Avro, we know that it is good in schema evolution.

Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving forward our concern is still schema evolution.

Does anyone know if schema evolution is possible in parquet, if yes How is it possible, if no then Why not.

Some resources claim that it is possible but it can only add columns at end.

What does this mean?

解决方案

Schema evolution can be (very) expensive.

In order to figure out schema, you basically have to read all of your parquet files and reconcile/merge their schemas during reading time which can be expensive depending on how many files or/and how many columns in there in the dataset.

Thus, since Spark 1.5, they switched off schema merging by default. You can always switch it back on).

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0.

Without schema evolution, you can read schema from one parquet file, and while reading rest of files assume it stays the same.

Parquet schema evolution is implementation-dependent.

Hive for example has a knob parquet.column.index.access=false that you could set to map schema by column names rather than by column index.

Then you could delete columns too, not just add.

As I said above, it is implementation-dependent, for example, Impala would not read such parquet tables correctly (fixed in recent Impala 2.6 release) [Reference].

Apache Spark, as of version 2.0.2, seems still only support adding columns: [Reference]

Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

PS: What I have seen some folks do to have more agility on schema changes, is that they create a view on top of actual parquet tables that map two (or more ) different but compatible schemas to one common schema.

Let's say you have added one new field (registration_date) and dropped another column (last_login_date) in your new release, then this would look like:

CREATE VIEW datamart.unified_fact_vw
AS
SELECT f1..., NULL as registration_date 
FROM datamart.unified_fact_schema1 f1
UNION ALL
SELECT f2..., NULL as last_login_date
FROM datamart.unified_fact_schema2 f2
;

You got the idea. Nice thing it would work the same across all sql on Hadoop dialects (like I mentioned above Hive, Impala and Spark), and still have all the benefits of Parquet tables (columnar storage, predicate push-down etc).

P.P.S: adding some information regarding common_metadata summary files that Spark can create to make this answer more complete.

Have a look at SPARK-15719

Parquet summary files are not particular useful nowadays since

 - when schema merging is disabled, we assume 
   schema of all Parquet part-files are identical, 
   thus we can read the footer from any part-files.

- when schema merging is enabled, we need to read footers 
  of all files anyway to do the merge.

On the other hand, writing summary files can be expensive,
because footers of all part-files must be read and merged. 
This is particularly costly when appending a small dataset 
to a large existing Parquet dataset.

So some points are against enabling common_metadata :

  • When a directory consists of Parquet files with a mixture of different schemas, _common_metadata allows readers to figure out a sensible schema for the whole directory without reading the schema of each individual file. Since Hive and Impala can access an SQL schema for said files from the Hive metastore, they can immediately start processing the individual files and match each of them against the SQL schema upon reading instead of exploring their common schema beforehand. This makes the common metadata feature unnecessary for Hive and Impala.

  • Even though Spark processes Parquet files without an SQL schema (unless using SparkSQL) and therefore in theory could benefit from _common_metadata, this feature was still deemed not to be useful and consequently got disabled by default in SPARK-15719.

  • Even if this feature were useful for querying, it is still a burden during writing. The metadata has to be maintained, which is not only slow, but also prone to racing conditions and other concurrency issues, suffers from the lack of atomicity guarantees, and easily leads to data correctness issues due to stale or inconsistent metadata.

  • The feature is undocumented and seems to be considered as deprecated (only "seems to be" because it never seems to have been supported officially at all in the first place, and a non-supported feature can not be deprecated either).

  • From one of Cloudera engineers: "I don't know whether the behavior has changed on the read side to avoid looking at each footer if the common_metadata file is present. But regardless, writing that file in the first place is a HUGE bottleneck, and has caused a lot of problems for our customers. I'd really strongly recommend they don't bother with trying to generate that metadata file."

  • "_common_metadata" and "_metadata" files are Spark specific and are not written by Impala and Hive for example, and perhaps other engines.

Summary metadata files in Spark may still have its use cases though - when there are no concurrency and other issues described above - for example, some streaming use cases - I guess that's why this feature wasn't completely removed from Spark.

这篇关于镶木地板格式的架构演变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆