Avro 与 Parquet [英] Avro vs. Parquet

查看:71
本文介绍了Avro 与 Parquet的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我打算在我的 hadoop 相关项目中使用一种 hadoop 文件格式.我理解对于基于列的查询和 avro 进行全面扫描或当我们需要所有列数据时,parquet 是有效的!

I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!

在我继续并选择其中一种文件格式之前,我想了解一种相对于另一种的缺点/缺点.谁能用简单的语言给我解释一下?

Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?

推荐答案

如果您还没有决定,我会继续为您的数据编写 Avro 模式.完成后,在 Avro 容器文件和 Parquet 文件之间进行选择就像换出一样简单,例如,

If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,

job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());

为了

job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());

Parquet 格式在写入方面的计算量似乎更大一些——例如,需要 RAM 用于缓冲和 CPU 用于排序数据等.但它应该减少 I/O、存储和传输成本以及实现高效读取,尤其是使用仅处理部分列的 SQL 类(例如,Hive 或 SparkSQL)查询.

The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.

在一个项目中,我最终从 Parquet 恢复到 Avro 容器,因为架构过于广泛和嵌套(派生自一些相当分层的面向对象的类)并导致了数千个 Parquet 列.反过来,我们的行组又宽又浅,这意味着我们需要很长时间才能处理每组最后一列中的少量行.

In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.

我还没有太多机会将 Parquet 用于更规范化/健全的数据,但我知道如果使用得当,它可以显着提高性能.

I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

这篇关于Avro 与 Parquet的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆