Parquet vs ORC与ORC与Snappy [英] Parquet vs ORC vs ORC with Snappy
问题描述
我正在对Hive提供的存储格式进行一些测试,并使用Parquet和ORC作为主要选项。我已经阅读了许多文件,说明Parquet在时间/空间复杂性方面比ORC好,但我的测试与我所经历的文档相反。
关注我的数据的一些细节。
表A-文本文件格式 - 2.5GB
表B - ORC - 652MB
表C - 带Snappy的ORC - 802MB
表D - 实木复合地板 - 1.9 GB
有关。
我对上述表格进行的测试得到了如下结果:
行计数操作
文本格式累计CPU - 123.33秒
拼花格式累计CPU - 204.92秒
ORC格式累积CPU - 119.99秒
带SNAPPY累积CPU的ORC - 107.05秒
列操作的总和on
文本格式累计CPU - 127.85秒
拼花格式累计CPU - 255.2秒
ORC格式累积CPU - 120.48秒
带SNAPPY累积CPU的ORC - 98.27秒
列操作的平均值
文本格式累积CPU - 128.79 sec
地板格式累积CPU - 211.73 sec
ORC格式累计CPU - 165.5 sec
ORC与SNAPPY累计CPU - 135.45秒
使用where子句从给定范围中选择4列
文字格式累积CPU - 72.48秒
镶木格式累积CPU - 136.4秒
ORC格式累积CPU - 96.63秒
带SNAPPY累积CPU的ORC - 82.05秒
这是否意味着ORC比Parquet更快?或者有什么我可以做的,使查询响应时间和压缩比更好地工作?
谢谢!
我会说,这两种格式都有各自的优点。 b
$ b
如果嵌套数据高度嵌套,拼版可能会更好,因为它将元素存储为像 Google Dremel 那样的树(见这里)。
Apache ORC可能会更好,如果你的文件结构变平了。
据我所知,实木复合地板还不支持Indexes。 ORC自带一个轻量级索引,并且自Hive 0.14以来,额外的Bloom Filter可能对更好的查询响应时间有所帮助,尤其是对于求和操作。
Parquet默认压缩是SNAPPY。表A - B - C和D保持相同的数据集?如果是的话,它看起来像是有一些阴影,当它只压缩到1.9 GB
时
I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.
I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.
Follows some details of my data.
Table A- Text File Format- 2.5GB
Table B - ORC - 652MB
Table C - ORC with Snappy - 802MB
Table D - Parquet - 1.9 GB
Parquet was worst as far as compression for my table is concerned.
My tests with the above tables yielded following results.
Row count operation
Text Format Cumulative CPU - 123.33 sec
Parquet Format Cumulative CPU - 204.92 sec
ORC Format Cumulative CPU - 119.99 sec
ORC with SNAPPY Cumulative CPU - 107.05 sec
Sum of a column operation
Text Format Cumulative CPU - 127.85 sec
Parquet Format Cumulative CPU - 255.2 sec
ORC Format Cumulative CPU - 120.48 sec
ORC with SNAPPY Cumulative CPU - 98.27 sec
Average of a column operation
Text Format Cumulative CPU - 128.79 sec
Parquet Format Cumulative CPU - 211.73 sec
ORC Format Cumulative CPU - 165.5 sec
ORC with SNAPPY Cumulative CPU - 135.45 sec
Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU - 72.48 sec
Parquet Format Cumulative CPU - 136.4 sec
ORC Format Cumulative CPU - 96.63 sec
ORC with SNAPPY Cumulative CPU - 82.05 sec
Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?
Thanks!
I would say, that both of these formats have their own advantages.
Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.
And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.
The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB
这篇关于Parquet vs ORC与ORC与Snappy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!