与Hive相比,Impala如何提供更快的查询响应 [英] How does impala provide faster query response compared to hive

查看:103
本文介绍了与Hive相比,Impala如何提供更快的查询响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近开始研究使用Hive和Impala查询HDFS上的大量CSV数据集.正如我所期望的,到目前为止,对于我迄今为止使用的查询,与Hive相比,与Impive相比,我得到的响应时间更好.

I have recently started looking into querying large sets of CSV data lying on HDFS using Hive and Impala. As I was expecting, I get better response time with Impala compared to Hive for the queries I have used so far.

我想知道是否仍有某些类型的查询/用例仍需要Hive,而Impala不太适合.

I am wondering if there are some types of queries/use cases that still need Hive and where Impala is not a good fit.

与Hive相比,对于HDFS上的相同数据,Impala如何提供比Hive更快的查询响应?

How does Impala provide faster query response compared to Hive for the same data on HDFS?

推荐答案

您应该将Impala视为"HDFS上的SQL",而Hive则更多是"Hadoop上的SQL".

You should see Impala as "SQL on HDFS", while Hive is more "SQL on Hadoop".

换句话说,Impala甚至根本不使用Hadoop.它只是在所有节点上运行着守护程序,这些守护程序缓存了HDFS中的某些数据,因此这些守护程序可以快速返回数据,而无需执行整个Map/Reduce作业.

In other words, Impala doesn't even use Hadoop at all. It simply has daemons running on all your nodes which cache some of the data that is in HDFS, so that these daemons can return data quickly without having to go through a whole Map/Reduce job.

这样做的原因是,运行Map/Reduce作业涉及一定的开销,因此,通过将Map/Reduce完全短路,可以在运行时获得相当大的收益.

The reason for this is that there is a certain overhead involved in running a Map/Reduce job, so by short-circuiting Map/Reduce altogether you can get some pretty big gain in runtime.

话虽如此,Impala并不能替代Hive,它对于非常不同的用例而言非常有用.与Hive相比,Impala不提供容错功能,因此,如果查询期间出现问题,则该问题将消失.绝对适合ETL类型的工作,其中一项工作的失败会造成巨大的损失,我建议Hive,但是Impala对于小型即席查询可能很棒,例如,对于只想看看并分析某些数据的数据科学家或业务分析师而言却没有建立稳固的工作.同样从我的个人经验来看,Impala仍然不是很成熟,并且有时我会看到当数据量大于可用内存时会发生一些崩溃.

That being said, Impala does not replace Hive, it is good for very different use cases. Impala doesn't provide fault-tolerance compared to Hive, so if there is a problem during your query then it's gone. Definitely for ETL type of jobs where failure of one job would be costly I would recommend Hive, but Impala can be awesome for small ad-hoc queries, for example for data scientists or business analysts who just want to take a look and analyze some data without building robust jobs. Also from my personal experience, Impala is still not very mature, and I've seen some crashes sometimes when the amount of data is larger than available memory.

这篇关于与Hive相比,Impala如何提供更快的查询响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆