与 hive 相比,impala 如何提供更快的查询响应 [英] How does impala provide faster query response compared to hive

查看:19
本文介绍了与 hive 相比,impala 如何提供更快的查询响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始研究使用 Hive 和 Impala 查询位于 HDFS 上的大量 CSV 数据集.正如我所料,对于我目前使用的查询,Impala 的响应时间比 Hive 的响应时间要短.

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far.

我想知道是否有某些类型的查询/用例仍然需要 Hive 以及 Impala 不适合的地方.

I am wondering if there are some types of queries/use cases that still need Hive and where Impala is not a good fit.

对于 HDFS 上的相同数据,Impala 如何提供比 Hive 更快的查询响应?

How does Impala provide faster query response compared to Hive for the same data on HDFS?

推荐答案

您应该将 Impala 视为SQL on HDFS",而 Hive 更像是SQL on Hadoop".

You should see Impala as "SQL on HDFS", while Hive is more "SQL on Hadoop".

换句话说,Impala 甚至根本不使用 Hadoop.它只是在您的所有节点上运行守护程序,缓存 HDFS 中的一些数据,以便这些守护程序可以快速返回数据,而无需执行整个 Map/Reduce 作业.

In other words, Impala doesn't even use Hadoop at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.

这样做的原因是运行 Map/Reduce 作业会产生一定的开销,因此通过将 Map/Reduce 完全短路,您可以在运行时获得相当大的收益.

The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so by short-circuiting Map/Reduce altogether you can get some pretty big gain in runtime.

话虽如此,Impala 不会取代 Hive,它适用于非常不同的用例.与 Hive 相比,Impala 不提供容错,因此如果在您的查询过程中出现问题,那么它就会消失.绝对适用于 ETL 类型的工作,其中一项工作的失败代价高昂我会推荐 Hive,但 Impala 对于小型临时查询非常有用,例如对于只想查看和分析一些数据的数据科学家或业务分析师没有建立强大的工作岗位.同样从我个人的经验来看,Impala 还不是很成熟,我看到了一些数据量大于可用内存时有时会崩溃的情况.

That being said, Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one job would be costly I would recommend Hive, but Impala can be awesome for small ad-hoc queries, for example for data scientists or business analysts who just want to take a look and analyze some data without building robust jobs. Also from my personal experience, Impala is still not very mature, and I've seen some crashes sometimes when the amount of data is larger than available memory.

这篇关于与 hive 相比,impala 如何提供更快的查询响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆