Parquet vs ORC与ORC与Snappy [英] Parquet vs ORC vs ORC with Snappy

查看:1956
本文介绍了Parquet vs ORC与ORC与Snappy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对Hive提供的存储格式进行一些测试,并使用Parquet和ORC作为主要选项。我已经阅读了许多文件,说明Parquet在时间/空间复杂性方面比ORC好,但我的测试与我所经历的文档相反。



关注我的数据的一些细节。

 表A-文本文件格式 -  2.5GB 

表B - ORC - 652MB

表C - 带Snappy的ORC - 802MB

表D - 实木复合地板 - 1.9 GB

有关。



我对上述表格进行的测试得到了如下结果:

行计数操作

 文本格式累计CPU  -  123.33秒

拼花格式累计CPU - 204.92秒

ORC格式累积CPU - 119.99秒

带SNAPPY累积CPU的ORC - 107.05秒

列操作的总和on

 文本格式累计CPU  -  127.85秒

拼花格式累计CPU - 255.2秒

ORC格式累积CPU - 120.48秒

带SNAPPY累积CPU的ORC - 98.27秒

列操作的平均值

 文本格式累积CPU  -  128.79 sec 

地板格式累积CPU - 211.73 sec

ORC格式累计CPU - 165.5 sec

ORC与SNAPPY累计CPU - 135.45秒

使用where子句从给定范围中选择4列

 文字格式累积CPU  -  72.48秒

镶木格式累积CPU - 136.4秒

ORC格式累积CPU - 96.63秒

带SNAPPY累积CPU的ORC - 82.05秒

这是否意味着ORC比Parquet更快?或者有什么我可以做的,使查询响应时间和压缩比更好地工作?



谢谢!

解决方案

我会说,这两种格式都有各自的优点。 b
$ b

如果嵌套数据高度嵌套,拼版可能会更好,因为它将元素存储为像 Google Dremel 那样的树(见这里)。

Apache ORC可能会更好,如果你的文件结构变平了。


据我所知,实木复合地板还不支持Indexes。 ORC自带一个轻量级索引,并且自Hive 0.14以来,额外的Bloom Filter可能对更好的查询响应时间有所帮助,尤其是对于求和操作。



Parquet默认压缩是SNAPPY。表A - B - C和D保持相同的数据集?如果是的话,它看起来像是有一些阴影,当它只压缩到1.9 GB


I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.

I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.

Follows some details of my data.

Table A- Text File Format- 2.5GB

Table B - ORC - 652MB

Table C - ORC with Snappy - 802MB

Table D - Parquet - 1.9 GB

Parquet was worst as far as compression for my table is concerned.

My tests with the above tables yielded following results.

Row count operation

Text Format Cumulative CPU - 123.33 sec

Parquet Format Cumulative CPU - 204.92 sec

ORC Format Cumulative CPU - 119.99 sec 

ORC with SNAPPY Cumulative CPU - 107.05 sec

Sum of a column operation

Text Format Cumulative CPU - 127.85 sec   

Parquet Format Cumulative CPU - 255.2 sec   

ORC Format Cumulative CPU - 120.48 sec   

ORC with SNAPPY Cumulative CPU - 98.27 sec

Average of a column operation

Text Format Cumulative CPU - 128.79 sec

Parquet Format Cumulative CPU - 211.73 sec    

ORC Format Cumulative CPU - 165.5 sec   

ORC with SNAPPY Cumulative CPU - 135.45 sec 

Selecting 4 columns from a given range using where clause

Text Format Cumulative CPU -  72.48 sec 

Parquet Format Cumulative CPU - 136.4 sec       

ORC Format Cumulative CPU - 96.63 sec 

ORC with SNAPPY Cumulative CPU - 82.05 sec 

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!

解决方案

I would say, that both of these formats have their own advantages.

Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB

这篇关于Parquet vs ORC与ORC与Snappy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆