Parquet vs ORC与ORC与Snappy [英] Parquet vs ORC vs ORC with Snappy

查看：1956 发布时间：2018/5/31 19:03:36 hadoop hive parquet snappy orc

本文介绍了Parquet vs ORC与ORC与Snappy的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在对Hive提供的存储格式进行一些测试，并使用Parquet和ORC作为主要选项。我已经阅读了许多文件，说明Parquet在时间/空间复杂性方面比ORC好，但我的测试与我所经历的文档相反。

关注我的数据的一些细节。

 表A-文本文件格式 -  2.5GB 
 
表B  -  ORC  -  652MB 
 
表C  - 带Snappy的ORC  -  802MB 
 
表D  - 实木复合地板 -  1.9 GB

有关。

我对上述表格进行的测试得到了如下结果：

行计数操作

 文本格式累计CPU  -  123.33秒
 
拼花格式累计CPU  -  204.92秒
 
 ORC格式累积CPU  -  119.99秒
 
带SNAPPY累积CPU的ORC  -  107.05秒

列操作的总和on

 文本格式累计CPU  -  127.85秒
 
拼花格式累计CPU  -  255.2秒
 
 ORC格式累积CPU  -  120.48秒
 
带SNAPPY累积CPU的ORC  -  98.27秒

列操作的平均值

 文本格式累积CPU  -  128.79 sec 
 
地板格式累积CPU  -  211.73 sec 
 
 ORC格式累计CPU  -  165.5 sec 
 
 ORC与SNAPPY累计CPU  -  135.45秒

使用where子句从给定范围中选择4列

 文字格式累积CPU  -  72.48秒
 
镶木格式累积CPU  -  136.4秒
 
 ORC格式累积CPU  -  96.63秒
 
带SNAPPY累积CPU的ORC  -  82.05秒

这是否意味着ORC比Parquet更快？或者有什么我可以做的，使查询响应时间和压缩比更好地工作？

谢谢！
解决方案
我会说，这两种格式都有各自的优点。 b
$ b
如果嵌套数据高度嵌套，拼版可能会更好，因为它将元素存储为像 Google Dremel 那样的树（见这里）。

Apache ORC可能会更好，如果你的文件结构变平了。

据我所知，实木复合地板还不支持Indexes。 ORC自带一个轻量级索引，并且自Hive 0.14以来，额外的Bloom Filter可能对更好的查询响应时间有所帮助，尤其是对于求和操作。

Parquet默认压缩是SNAPPY。表A - B - C和D保持相同的数据集？如果是的话，它看起来像是有一些阴影，当它只压缩到1.9 GB

时
I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.

I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.

Follows some details of my data.
Table A- Text File Format- 2.5GB Table B - ORC - 652MB Table C - ORC with Snappy - 802MB Table D - Parquet - 1.9 GB
Parquet was worst as far as compression for my table is concerned.

My tests with the above tables yielded following results.

Row count operation
Text Format Cumulative CPU - 123.33 sec Parquet Format Cumulative CPU - 204.92 sec ORC Format Cumulative CPU - 119.99 sec ORC with SNAPPY Cumulative CPU - 107.05 sec
Sum of a column operation
Text Format Cumulative CPU - 127.85 sec Parquet Format Cumulative CPU - 255.2 sec ORC Format Cumulative CPU - 120.48 sec ORC with SNAPPY Cumulative CPU - 98.27 sec
Average of a column operation
Text Format Cumulative CPU - 128.79 sec Parquet Format Cumulative CPU - 211.73 sec ORC Format Cumulative CPU - 165.5 sec ORC with SNAPPY Cumulative CPU - 135.45 sec
Selecting 4 columns from a given range using where clause
Text Format Cumulative CPU - 72.48 sec Parquet Format Cumulative CPU - 136.4 sec ORC Format Cumulative CPU - 96.63 sec ORC with SNAPPY Cumulative CPU - 82.05 sec
Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

Thanks!
解决方案
I would say, that both of these formats have their own advantages.

Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB

这篇关于Parquet vs ORC与ORC与Snappy的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Parquet vs ORC与ORC与Snappy [英] Parquet vs ORC vs ORC with Snappy

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录关闭

Parquet vs ORC与ORC与Snappy [英] Parquet vs ORC vs ORC with Snappy

问题描述

相关文章

分布式计算/Hadoop最新文章

热门教程

热门工具

登录 关闭

登录关闭