与其他格式相比,镶木地板格式的优缺点是什么? [英] What are the pros and cons of parquet format compared to other formats?

查看:29
本文介绍了与其他格式相比,镶木地板格式的优缺点是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Apache Parquet 的特点是:

Characteristics of Apache Parquet are :

  • 自我描述
  • 列格式
  • 独立于语言

与 Avro、序列文件、RC 文件等相比.我想了解一下这些格式.我已经阅读了:Impala 如何使用 Hadoop 文件格式 ,它提供了一些关于格式的见解,但我想知道如何访问数据 &数据的存储以这些格式中的每一种完成.镶木地板比其他地板有什么优势?

In comparison to Avro, Sequence Files, RC File etc. I want an overview of the formats. I have already read : How Impala Works with Hadoop File Formats , it gives some insights on the formats but I would like to know how the access to data & storage of data is done in each of these formats. How parquet has an advantage over the others?

推荐答案

我认为我可以描述的主要区别与面向记录的格式与面向列的格式有关.面向记录的格式是我们都习惯的格式——文本文件、分隔格式,如 CSV、TSV.AVRO 比那些更酷,因为它可以随着时间的推移改变模式,例如从记录中添加或删除列.各种格式的其他技巧(特别是包括压缩)涉及格式是否可以拆分——也就是说,您是否可以从数据集中的任何位置读取记录块并仍然知道它的模式?但这里有关于 Parquet 等柱状格式的更多详细信息.

I think the main difference I can describe relates to record oriented vs. column oriented formats. Record oriented formats are what we're all used to -- text files, delimited formats like CSV, TSV. AVRO is slightly cooler than those because it can change schema over time, e.g. adding or removing columns from a record. Other tricks of various formats (especially including compression) involve whether a format can be split -- that is, can you read a block of records from anywhere in the dataset and still know it's schema? But here's more detail on columnar formats like Parquet.

Parquet 和其他列格式可以非常有效地处理常见的 Hadoop 情况.在设计良好的关系数据库中,表(数据集)的列数通常比您预期的多得多——一百或两百列并不罕见.之所以如此,是因为我们经常使用 Hadoop 作为对关系格式的数据进行非规范化 的地方——是的,您会得到很多重复的值,并且很多表都被扁平化为一个.但是由于所有连接都已计算出来,因此查询变得容易得多.还有其他优势,例如保留实时状态数据.所以无论如何,在一个表中有一大堆列是很常见的.

Parquet, and other columnar formats handle a common Hadoop situation very efficiently. It is common to have tables (datasets) having many more columns than you would expect in a well-designed relational database -- a hundred or two hundred columns is not unusual. This is so because we often use Hadoop as a place to denormalize data from relational formats -- yes, you get lots of repeated values and many tables all flattened into a single one. But it becomes much easier to query since all the joins are worked out. There are other advantages such as retaining state-in-time data. So anyway it's common to have a boatload of columns in a table.

假设有 132 列,其中一些是非常长的文本字段,每个不同的列一个接一个,每条记录可能使用 10K.

Let's say there are 132 columns, and some of them are really long text fields, each different column one following the other and use up maybe 10K per record.

虽然从 SQL 的角度查询这些表很容易,但通常您希望仅基于这一百多列中的少数几列来获取一定范围的记录.例如,您可能需要 2 月和 3 月销售额大于 500 美元的客户的所有记录.

While querying these tables is easy with SQL standpoint, it's common that you'll want to get some range of records based on only a few of those hundred-plus columns. For example, you might want all of the records in February and March for customers with sales > $500.

要以行格式执行此操作,查询需要扫描数据集的每条记录.读取第一行,将记录解析为字段(列)并获取日期和销售额列,如果满足条件,则将其包含在结果中.重复.如果您有 10 年(120 个月)的历史记录,那么您正在阅读每条记录只是为了找到其中的 2 个月.当然,这是在年和月使用分区的绝佳机会,但即便如此,您还是要读取和解析那两个月内每条记录/行的 10K 条记录,只是为了确定客户的销售额是否大于 500 美元.

To do this in a row format the query would need to scan every record of the dataset. Read the first row, parse the record into fields (columns) and get the date and sales columns, include it in your result if it satisfies the condition. Repeat. If you have 10 years (120 months) of history, you're reading every single record just to find 2 of those months. Of course this is a great opportunity to use a partition on year and month, but even so, you're reading and parsing 10K of each record/row for those two months just to find whether the customer's sales are > $500.

在列式格式中,记录的每一列(字段)与其他同类存储在一起,分布在磁盘上的许多不同块中——年的列一起,月的列一起,客户员工手册的列(或其他长文本),以及所有其他使这些记录如此庞大的所有其他人都在磁盘上各自独立的位置,当然还有用于销售的列.哎呀,日期和月份都是数字,销售额也是——它们只是几个字节.如果我们只需要为每条记录读取几个字节来确定哪些记录与我们的查询匹配,那不是很好吗?列式存储助您一臂之力!

In a columnar format, each column (field) of a record is stored with others of its kind, spread all over many different blocks on the disk -- columns for year together, columns for month together, columns for customer employee handbook (or other long text), and all the others that make those records so huge all in their own separate place on the disk, and of course columns for sales together. Well heck, date and months are numbers, and so are sales -- they are just a few bytes. Wouldn't it be great if we only had to read a few bytes for each record to determine which records matched our query? Columnar storage to the rescue!

即使没有分区,扫描满足我们查询所需的小字段也非常快——它们都是按记录排序的,并且大小相同,因此磁盘寻找包含记录的数据检查要少得多.无需通读员工手册和其他长文本字段——只需忽略它们即可.因此,通过将列而不是行分组,您几乎总是可以扫描更少的数据.赢了!

Even without partitions, scanning the small fields needed to satisfy our query is super-fast -- they are all in order by record, and all the same size, so the disk seeks over much less data checking for included records. No need to read through that employee handbook and other long text fields -- just ignore them. So, by grouping columns with each other, instead of rows, you can almost always scan less data. Win!

但是等等,它会变得更好.如果您的查询只需要知道这些值和更多值(假设 132 列中的 10 列)并且不关心员工手册列,一旦它选择了正确的记录返回,它现在只需要去回到渲染结果所需的 10 列,忽略数据集中 132 列中的其他 122 列.同样,我们跳过了很多阅读.

But wait, it gets better. If your query only needed to know those values and a few more (let's say 10 of the 132 columns) and didn't care about that employee handbook column, once it had picked the right records to return, it would now only have to go back to the 10 columns it needed to render the results, ignoring the other 122 of the 132 in our dataset. Again, we skip a lot of reading.

(注意:出于这个原因,在进行直接转换时,列格式是一个糟糕的选择,例如,如果您将所有两个表连接成一个大(ger)结果集,然后将其另存为新表,无论如何都会对源进行完全扫描,因此在读取性能方面没有太多好处,并且由于列格式需要更多地记住内容的位置,因此与类似的行格式相比,它们使用更多的内存).

(Note: for this reason, columnar formats are a lousy choice when doing straight transformations, for example, if you're joining all of two tables into one big(ger) result set that you're saving as a new table, the sources are going to get scanned completely anyway, so there's not a lot of benefit in read performance, and because columnar formats need to remember more about the where stuff is, they use more memory than a similar row format).

柱状的另一个好处:数据四处散播.要获得单个记录,您可以让 132 个工作人员在 132 个数据块上的 132 个不同位置读取(和写入)数据.是的并行化!

One more benefit of columnar: data is spread around. To get a single record, you can have 132 workers each read (and write) data from/to 132 different places on 132 blocks of data. Yay for parallelization!

现在关键在于:压缩算法在找到重复模式时效果会更好.您可以将 AABBBBBBCCCCCCCCCCCCCCCC 压缩为 2A6B16CABCABCBCBCBCCCCCCCCCCCCCC 不会变得那么小(好吧,实际上,在这种情况下会,但相信我:-) ).所以再一次,少读书.还有写作.

And now for the clincher: compression algorithms work much better when it can find repeating patterns. You could compress AABBBBBBCCCCCCCCCCCCCCCC as 2A6B16C but ABCABCBCBCBCCCCCCCCCCCCCC wouldn't get as small (well, actually, in this case it would, but trust me :-) ). So once again, less reading. And writing too.

因此,我们为回答常见查询而读取的数据要少得多,并行读取和写入可能会更快,并且压缩往往会更好地工作.

So we read a lot less data to answer common queries, it's potentially faster to read and write in parallel, and compression tends to work much better.

当您的输入端很大时,Columnar 很棒,而您的输出是一个过滤后的子集:从大到小都很棒.当输入和输出大致相同时,就不那么有利了.

Columnar is great when your input side is large, and your output is a filtered subset: from big to little is great. Not as beneficial when the input and outputs are about the same.

但在我们的案例中,Impala 执行了我们在 5、10、20 或 30 分钟内运行的旧 Hive 查询,并在几秒钟或一分钟内完成了大部分查询.

But in our case, Impala took our old Hive queries that ran in 5, 10, 20 or 30 minutes, and finished most in a few seconds or a minute.

希望这至少有助于回答您的部分问题!

Hope this helps answer at least part of your question!

这篇关于与其他格式相比,镶木地板格式的优缺点是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆