以镶木地板格式进行图式演变 [英] Schema evolution in parquet format

查看:183
本文介绍了以镶木地板格式进行图式演变的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

目前我们正在生产中使用Avro数据格式。
在Avro的N个优点中,我们知道它在模式演变方面很出色。



现在我们正在评估 Parquet格式因为它在阅读随机列时效率很高。
所以在继续前进之前,我们的担忧是模式的进化!

有人知道模式进化是否可能在实木复合地板中,强>如何,如果没有为什么
一些演示文稿说这是可能的,但最后只能添加列



这是什么意思?



感谢,
〜新手开发人员

解决方案

,因为要计算出模式,你必须基本读取所有实验室文件,并在读取时重新合并/合并其模式,这可能很昂贵,取决于数据集中有多少个文件/多少个列。



这就是为什么在Spark 1.5中,他们关闭了模式演变(默认情况下可以重新开启)。
http://spark.apache.org/docs/最新的/ sql-programming-guide.html

lockquote

因为模式合并是一个相对昂贵的操作,并且不是$在大多数情况下,b $ b的必要性,我们默认从
1.5.0开始关闭。


您可以从一个parquet文件中读取模式,并在读取其余文件时假定它保持不变。



实施模式演变取决于实施。



Hive例如有一个旋钮


parquet.column.index.access = false


,您可以设置按照列名称而不是列索引来映射模式。
然后您可以删除列,而不仅仅是添加。



正如我所说的,它是依赖于实现的,例如,Impala不会读取这样的拼花地图表正确(在最近的Impala 2.6版本中修复):
http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/external-table-stored-as- parquet-can-not-use-field-inside-a / mp / 36012


Spark 2.0.2版似乎仍然只支持添加列:
http://spark.apache。 org / docs / latest / sql-programming-guide.html#架构合并


用户可以从一个简单的模式开始,并根据需要逐步向模式中添加更多列
。通过这种方式,用户可能会得到多个
Parquet文件,这些文件具有不同的但是相互兼容的模式。
Parquet数据源现在可以自动检测到这种情况,
合并所有这些文件的模式。


PS。我所看到的一些人在模式更改上有更多的灵活性,是他们在实际的parquet表上创建了一个视图,它将两个(或多个)不同但可兼容的模式映射到一个公共模式。假设您在新版本中添加了一个新字段( registration_date )并删除了另一列( last_login_date ),然后这看起来像:

  CREATE VIEW datamart.unified_fact_vw 
AS
SELECT f1 ...,NULL as registration_date
FROM datamart.unified_fact_schema1 f1
UNION ALL
SELECT f2 ...,NULL as last_login_date
FROM datamart.unified_fact_schema2 f2
;

你有这个想法..很好的事情,它将在Hadoop方言的所有sql我之前提到过Hive,Impala和Spark),并且仍然拥有Parquet表的所有优点(列式存储,谓词下推等)。

Currently we are using Avro data format in production. Out of N good points of Avro, we know that it is good in schema evolution.

Now we are evaluating Parquet format because of its efficiency while reading random columns. So before moving forward our concern is schema evolution!

Does anyone know if schema evolution is possible in parquet, if yes How, if no why. Some presentation saying that it is possible but Can only add columns at end

What does this mean?

Thanks, ~Novice Developer

解决方案

Schema evolution can be (very) expensive, because to figure out schema you have to basically read all parquet files and reconsile/merge its schemas at read time which can be expensive depending how many files / how many columns in the dataset.

That's why in Spark 1.5, they switched off schema evolution (by default but can be switched back on). http://spark.apache.org/docs/latest/sql-programming-guide.html :

Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0.

Without schema evolution you can read schema from one parquet file, and while reading rest of files assume it stays the same.

Parquet schema evolution is implementation-dependent.

Hive for example has a knob

parquet.column.index.access=false

that you could set to map schema by column names rather than by column index. Then you could delete columns too, not just add.

As I said, it is implementation-dependent, for example, Impala would not read such parquet tables correctly (fixed in recent Impala 2.6 release): http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/external-table-stored-as-parquet-can-not-use-field-inside-a/m-p/36012

Spark as of version 2.0.2 seems still only support adding columns: http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging

Users can start with a simple schema, and gradually add more columns to the schema as needed. In this way, users may end up with multiple Parquet files with different but mutually compatible schemas. The Parquet data source is now able to automatically detect this case and merge schemas of all these files.

PS. What I have seen some folks do to have more agility on schema changes, is that they create a view on top of actual parquet tables that map two (or more ) different but compatible schemas to one common schema. Let's say you have added one new field (registration_date) and dropped another column (last_login_date) in your new release, then this would look like:

CREATE VIEW datamart.unified_fact_vw
AS
SELECT f1..., NULL as registration_date 
FROM datamart.unified_fact_schema1 f1
UNION ALL
SELECT f2..., NULL as last_login_date
FROM datamart.unified_fact_schema2 f2
;

you got the idea.. Nice thing it would work the same across all sql on Hadoop dialects (like I mentioned above Hive, Impala and Spark), and still have all the benefits of Parquet tables (columnar storage, predicate push-down etc).

这篇关于以镶木地板格式进行图式演变的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆