Avro与镶木地板 [英] Avro vs. Parquet

查看:92
本文介绍了Avro与镶木地板的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我计划为我的hadoop相关项目使用hadoop文件格式之一.我了解实木复合地板对于基于列的查询和avro进行全面扫描,或者在我们需要所有列数据时非常有效!

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!

在继续选择一种文件格式之前,我想了解一种相对于另一种的缺点/缺点.有人可以简单地向我解释吗?

Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?

推荐答案

如果您还没有决定,我将继续为您的数据编写Avro模式.完成此操作后,在Avro容器文件和Parquet文件之间进行选择就像换出例如,

If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,

job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());

job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());

Parquet格式在写方面似乎确实需要更多的计算量-例如,需要RAM进行缓冲,而CPU需要对数据进行排序等,但是它应该减少I/O,存储和传输成本以及尤其在那些仅处理部分列的类似SQL的查询(例如Hive或SparkSQL)中进行有效的读取.

The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.

在一个项目中,我最终从Parquet转换为Avro容器,因为该模式过于广泛和嵌套(源自某些相当分层的面向对象的类),并导致了数千个Parquet列.反过来,我们的行组确实又宽又浅,这意味着要花很长时间才能在每个组的最后一列中处理少量的行.

In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.

我没有太多机会使用Parquet来处理更规范化/合理的数据,但是我知道,如果使用得当,它可以显着提高性能.

I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

这篇关于Avro与镶木地板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆