Avro v / s镶木地板 [英] Avro v/s Parquet

查看:124
本文介绍了Avro v / s镶木地板的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我打算为我的hadoop相关项目使用hadoop文件格式之一。我了解实木复合地板对于基于列的查询非常有效,而且对于全扫描或当我们需要所有列数据时都是有效的!

在我继续选择其中一种文件格式之前,我想先了解一个文件格式的缺点/缺点。任何人都可以用简单的方式向我解释它吗? 解决方案

如果你还没有决定, Avro模式为您的数据。完成之后,在Avro容器文件和Parquet文件之间进行选择就如同交换一样简单,例如,

  job.setOutputFormatClass(AvroKeyOutputFormat 。类); 
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());

for

  job.setOutputFormatClass(AvroParquetOutputFormat.class); 
AvroParquetOutputFormat.setSchema(job,MyAvroType.getClassSchema());

Parquet格式在写入方面确实似乎有点计算量 - 例如,要求用于缓冲的RAM和用于排序数据的CPU等,但它应该减少I / O,存储和传输成本以及高效的读取,尤其是对于类似于SQL的查询(例如Hive或SparkSQL)在一个项目中,我最终从Parquet恢复到Avro容器,因为该模式过于广泛和嵌套(从一些相当分层的面向对象类派生)并导致了1000年的镶木柱。反过来,我们的行组真的很宽,很浅,这意味着我们可以在每个组的最后一列处理少量行之前花费很长时间。



I没有太多机会使用Parquet来获得更加规范化/理性的数据,但我知道如果使用得当,它可以显着提高性能。


I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!

Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?

解决方案

If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,

job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());

for

job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());

The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.

In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.

I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

这篇关于Avro v / s镶木地板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆