为什么在Hive中查询Parquet文件比文本文件要慢? [英] Why is querying Parquet files is slower than text files in Hive?

查看：587 发布时间：2020/11/22 2:14:57 hadoop hive parquet mapr snappy

本文介绍了为什么在Hive中查询Parquet文件比文本文件要慢?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我决定将Parquet用作配置单元表的存储格式，在我将其实际实现在集群中之前，我决定运行一些测试.出乎意料的是，在我的测试中，Parquet的速度较慢，而通常的说法是，它比纯文本文件要快.

I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster than plain text files.

请注意，我在MapR上使用的是Hive-0.13

----------------------------------------------------------
|             | Table A | Table B | Table C |            |
----------------------------------------------------------
| Format      | Text    | Parquet | Parquet |            |
| Size[Gb]    | 2.5     | 1.9     | 1.9     |            |
| Comrepssion | N/A     | N/A     | Snappy  |            |
| CPU [sec]   | 123.33  | 204.92  | N/A     | Operation1 |
| Time [sec]  | 59.057  | 50.33   | N/A     | Operation1 |
| CPU [sec]   | 51.18   | 117.08  | N/A     | Operation2 |
| Time [sec]  | 25.296  | 27.448  | N/A     | Operation2 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation3 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation3 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation4 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation4 |
| CPU [sec]   | 127.85  | 255.2   | N/A     | Operation5 |
| Time [sec]  | 29.68   | 41.025  | N/A     | Operation5 |

操作1:行计数操作
Operation2:单行选择
操作3:使用Where子句[提取1000行]的多行选择
操作4:多行选择[仅4列]使用Where子句[提取1000行]
Operation5:聚合操作[在给定列上使用求和函数]

Operation1: Row count operation
Operation2: Single Row Selection
Operation3: Multi Row Selection Using Where clause [1000 rows fetched]
Operation4: Multi Row Selection [with only 4 columns] Using Where clause [1000 rows fetched]
Operation5: Aggregation operation [Using sum function on a given column]

您可以看到，在我对这两个表进行的几乎所有操作中，Parquet在执行查询所花费的时间方面都落后于行计数操作.

You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.

我还使用表C来执行上述操作，但结果几乎与TextFile格式在相似的行上，再次是两者的比较好.

I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.

有人可以让我知道我在做什么错吗?

Can some one please let me know what I am doing wrong?

谢谢！

编辑

我将ORC添加到存储格式列表中，然后再次运行测试.详细说明.

I added ORC to the list of storage formats and ran the tests again. Follows the details.

行计数操作

文本格式累计CPU-123.33秒

Text Format Cumulative CPU - 123.33 sec

镶木地板格式累计CPU-204.92秒

Parquet Format Cumulative CPU - 204.92 sec

ORC格式累积CPU-119.99秒

ORC Format Cumulative CPU - 119.99 sec

带有SNAPPY累积CPU的ORC-107.05秒

ORC with SNAPPY Cumulative CPU - 107.05 sec

列操作的总和

文本格式累计CPU-127.85秒

Text Format Cumulative CPU - 127.85 sec

镶木地板格式累计CPU-255.2秒

Parquet Format Cumulative CPU - 255.2 sec

ORC格式累计CPU-120.48秒

ORC Format Cumulative CPU - 120.48 sec

带有SNAPPY累积CPU的ORC-98.27秒

ORC with SNAPPY Cumulative CPU - 98.27 sec

列操作的平均值

文本格式累计CPU-128.79秒

Text Format Cumulative CPU - 128.79 sec

镶木地板格式累计CPU-211.73秒

Parquet Format Cumulative CPU - 211.73 sec

ORC格式累计CPU-165.5秒

ORC Format Cumulative CPU - 165.5 sec

带有SNAPPY累积CPU的ORC-135.45秒

ORC with SNAPPY Cumulative CPU - 135.45 sec

使用where子句从给定范围中选择4列

文本格式累计CPU-72.48秒

Text Format Cumulative CPU - 72.48 sec

镶木地板格式累计CPU-136.4秒

Parquet Format Cumulative CPU - 136.4 sec

ORC格式累积CPU-96.63秒

ORC Format Cumulative CPU - 96.63 sec

带有SNAPPY累积CPU的ORC-82.05秒

ORC with SNAPPY Cumulative CPU - 82.05 sec

这是否意味着ORC比Parquet更快?还是我可以做些什么来使其在查询响应时间和压缩率上更好地工作?

Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

谢谢！

为什么在Hive中查询Parquet文件比文本文件要慢? [英] Why is querying Parquet files is slower than text files in Hive?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

为什么在Hive中查询Parquet文件比文本文件要慢? [英] Why is querying Parquet files is slower than text files in Hive?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭