为什么在Hive中查询Parquet文件比文本文件要慢? [英] Why is querying Parquet files is slower than text files in Hive?

查看:587
本文介绍了为什么在Hive中查询Parquet文件比文本文件要慢?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我决定将Parquet用作配置单元表的存储格式,在我将其实际实现在集群中之前,我决定运行一些测试.出乎意料的是,在我的测试中,Parquet的速度较慢,而通常的说法是,它比纯文本文件要快.

I decided to use Parquet as storage format for hive tables and before I actually implement it in my cluster, I decided to run some tests. Surprisingly, Parquet was slower in my tests as against the general notion that it is faster than plain text files.

请注意,我在MapR上使用的是Hive-0.13

----------------------------------------------------------
|             | Table A | Table B | Table C |            |
----------------------------------------------------------
| Format      | Text    | Parquet | Parquet |            |
| Size[Gb]    | 2.5     | 1.9     | 1.9     |            |
| Comrepssion | N/A     | N/A     | Snappy  |            |
| CPU [sec]   | 123.33  | 204.92  | N/A     | Operation1 |
| Time [sec]  | 59.057  | 50.33   | N/A     | Operation1 |
| CPU [sec]   | 51.18   | 117.08  | N/A     | Operation2 |
| Time [sec]  | 25.296  | 27.448  | N/A     | Operation2 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation3 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation3 |
| CPU [sec]   | 57.55   | 113.97  | N/A     | Operation4 |
| Time [sec]  | 20.254  | 27.678  | N/A     | Operation4 |
| CPU [sec]   | 127.85  | 255.2   | N/A     | Operation5 |
| Time [sec]  | 29.68   | 41.025  | N/A     | Operation5 |

  • 操作1:行计数操作
  • Operation2:单行选择
  • 操作3:使用Where子句[提取1000行]的多行选择
  • 操作4:多行选择[仅4列]使用Where子句[提取1000行]
  • Operation5:聚合操作[在给定列上使用求和函数]
    • Operation1: Row count operation
    • Operation2: Single Row Selection
    • Operation3: Multi Row Selection Using Where clause [1000 rows fetched]
    • Operation4: Multi Row Selection [with only 4 columns] Using Where clause [1000 rows fetched]
    • Operation5: Aggregation operation [Using sum function on a given column]
    • 您可以看到,在我对这两个表进行的几乎所有操作中,Parquet在执行查询所花费的时间方面都落后于行计数操作.

      You can see that in almost all the operations that I have applied on both the tables, Parquet is lagging behind in terms of time taken to execute the query with an exception of row count operation.

      我还使用表C来执行上述操作,但结果几乎与TextFile格式在相似的行上,再次是两者的比较好.

      I also used table C to perform the aforementioned operations but the results were almost on similar lines with TextFile format again was snappier of the two.

      有人可以让我知道我在做什么错吗?

      Can some one please let me know what I am doing wrong?

      谢谢!

      编辑

      我将ORC添加到存储格式列表中,然后再次运行测试.详细说明.

      I added ORC to the list of storage formats and ran the tests again. Follows the details.

      行计数操作

      文本格式累计CPU-123.33秒

      Text Format Cumulative CPU - 123.33 sec

      镶木地板格式累计CPU-204.92秒

      Parquet Format Cumulative CPU - 204.92 sec

      ORC格式累积CPU-119.99秒

      ORC Format Cumulative CPU - 119.99 sec

      带有SNAPPY累积CPU的ORC-107.05秒

      ORC with SNAPPY Cumulative CPU - 107.05 sec

      列操作的总和

      文本格式累计CPU-127.85秒

      Text Format Cumulative CPU - 127.85 sec

      镶木地板格式累计CPU-255.2秒

      Parquet Format Cumulative CPU - 255.2 sec

      ORC格式累计CPU-120.48秒

      ORC Format Cumulative CPU - 120.48 sec

      带有SNAPPY累积CPU的ORC-98.27秒

      ORC with SNAPPY Cumulative CPU - 98.27 sec

      列操作的平均值

      文本格式累计CPU-128.79秒

      Text Format Cumulative CPU - 128.79 sec

      镶木地板格式累计CPU-211.73秒

      Parquet Format Cumulative CPU - 211.73 sec

      ORC格式累计CPU-165.5秒

      ORC Format Cumulative CPU - 165.5 sec

      带有SNAPPY累积CPU的ORC-135.45秒

      ORC with SNAPPY Cumulative CPU - 135.45 sec

      使用where子句从给定范围中选择4列

      文本格式累计CPU-72.48秒

      Text Format Cumulative CPU - 72.48 sec

      镶木地板格式累计CPU-136.4秒

      Parquet Format Cumulative CPU - 136.4 sec

      ORC格式累积CPU-96.63秒

      ORC Format Cumulative CPU - 96.63 sec

      带有SNAPPY累积CPU的ORC-82.05秒

      ORC with SNAPPY Cumulative CPU - 82.05 sec

      这是否意味着ORC比Parquet更快?还是我可以做些什么来使其在查询响应时间和压缩率上更好地工作?

      Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

      谢谢!

      推荐答案

      首先,我想指出的是,实际上不可能用给定的细节来回答您的问题.

      First I would like to just point out that it is virtually impossible to answer your question with the given details.

      几点:

      • 在分布式环境中测量时间并不是确定某件事是否缓慢的方法(如果您正在运行许多查询并争夺资源,那么您就无法衡量自己的想法)

      • measuring time in a distributed environment is not the way to determine if something is slow (if you have many queries running and competing for resources you are not measuring what you think you are measuring)

      不提供实际的表定义,并且针对这些表运行的查询使该问题无法重现

      not providing the actual table definition and the queries running against those tables makes this problem impossible to reproduce

      不提供表格的行数和基数(其各个字段)也无济于事

      not providing the number of rows of the table and the cardinality its individual fields is also not helping

      通常,查询Parquet的速度比查询文本文件的速度快得多,这是因为Parquet使用许多东西来使读取操作快得多.这些东西很少:

      In general, querying Parquet is much faster than querying text files because Parquet employs many things to make read operations much faster. Few of these things:

      • 压缩
      • 游程长度编码
      • 字典编码

      根据用例,可以将事物的某些参数调整为确切的用例.

      Depending on the use case some of the parameters of things can be tuned to the exact use case.

      这篇关于为什么在Hive中查询Parquet文件比文本文件要慢?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆