快速Hadoop的分析(Cloudera的黑斑羚VS星火/鲨鱼VS阿帕奇钻) [英] Fast Hadoop Analytics (Cloudera Impala vs Spark/Shark vs Apache Drill)

查看:305
本文介绍了快速Hadoop的分析(Cloudera的黑斑羚VS星火/鲨鱼VS阿帕奇钻)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要对在HDFS中的数据做一些近实时的数据分析(OLAP样)。结果
我的研究表明,上述三个框架的类似Apache蜂巢报告显著的性能提升。有没有人有那些任何一个部分的实践经验?不仅涉及性能,而且对于稳定?

I want to do some "near real-time" data analysis (OLAP-like) on the data in a HDFS.
My research showed that the three mentioned frameworks report significant performance gains compared to Apache Hive. Does anyone have some practical experience with either one of those? Not only concerning performance, but also with respect of stability?

推荐答案

蜂房和黑斑羚或比较星火或钻有时听起来不合适我。背后开发蜂房,这些工具的目标是不同的。蜂房从未进行实时开发,在内存中处理,并基于马preduce。它是专为脱机批处理有点儿东西。最适合当你需要进行数据重操作,如连接上非常庞大的数据集长时间运行的工作。

Comparison between Hive and Impala or Spark or Drill sometimes sounds inappropriate to me. The goals behind developing Hive and these tools were different. Hive was never developed for real-time, in memory processing and is based on MapReduce. It was built for offline batch processing kinda stuff. Best suited when you need long running jobs performing data heavy operations like joins on very huge datasets.

在另一方面,这些工具被开发保持实时性的初衷。去为他们当你需要查询也不是很庞大的数据,即可以适合到内存,实时性。我不是说你可以在你的BigData使用这些工具无法运行查询,但你如果你是在数据的PB运行实时查询中挑战极限,恕我直言。

On the other hand these tools were developed keeping the real-timeness in mind. Go for them when you need to query not very huge data, that can be fit into the memory, real-time. I'm not saying you can't run queries on your BigData using these tools, but you would be pushing the limits if you are running real-time queries on PBs of data, IMHO.

常常你会看到(或读取)的特定公司有几个数据PBS中,他们成功地迎合客户的实时需求。但实际上,这些企业没有自己的查询大部分时间整个数据。因此,重要的是适当的规划,时使用何种。我希望你得到我想要做的地步。

Quite often you would have seen(or read) that a particular company has several PBs of data and they are successfully catering real-time needs of their customers. But actually these companies are not querying their entire data most of the time. So, the important thing is proper planning, when to use what. I hope you get the point i'm trying to make.

回到你的实际问题,在我看来这是很难提供在这个时候一个合理的比较,因为这些项目大部分都远未完成。他们不是生产准备好,除非你愿意做一些工作(或者很多)你自己的。并且,对于每个这些项目有一定的目标,这是非常特定于该特定项目。

Coming back to your actual question, in my view it is hard to provide a reasonable comparison at this time since most of these projects are far from completed. They are not production ready yet, unless you are willing to do some(or maybe a lot) of work on your own. And, for each of these projects there are certain goals which are very specific to that particular project.

例如,黑斑羚的开发是为了充分利用现有的基础设施,蜂房,使您不必从头开始。它使用蜂巢使用相同的元数据。它的目标是在现有的Hadoop仓库之上运行实时查询。而钻被开发成为一个不仅Hadoop的项目。而对我们提供跨多个大数据平台,包括MongoDB中,卡桑德拉,和了Riak Splunk的分布式查询功能。鲨鱼与Apache蜂巢,这意味着你可以,你会通过蜂巢用同样的语句HiveQL查询它兼容。所不同的是鲨鱼能够比相同的查询上蜂房运行得更快返回结果高达30倍。

For example, Impala was developed to take advantage of existing Hive infrastructure so that you don't have to start from scratch. It uses the same metadata which Hive uses. It's goal was to run real-time queries on top of your existing Hadoop warehouse. Whereas Drill was developed to be a not only Hadoop project. And to provide us a distributed query capabilities across multiple big data platforms including MongoDB, Cassandra, Riak and Splunk. Shark is compatible with Apache Hive, which means that you can query it using the same HiveQL statements as you would through Hive. The difference is that Shark can return results up to 30 times faster than the same queries run on Hive.

因帕拉做擅长present和一些人一直在使用它了,但我没那么有信心的。所有这些工具其余的都很好,但是一个公平的比较,可以让你尝试这些后,才您的数据和处理需求。但根据我的经验,黑斑羚将在这一刻的最佳选择。我不是说其他​​的工具并不好,但他们尚未成熟。但是,如果你想与你已经运行的Hadoop集群(Apache的Hadoop的为前),用它作为帕拉几乎被大家作为一个CDH功能结合使用,您可能需要做一些额外的工作。

Impala is doing good at present and some folks have been using it, but i'm not that confident about rest of the 2. All these tools are good but a fair comparison can be made only after you try these on your data and for your processing needs. But as per my experience Impala would be the best bet at this moment. I am not saying other tools are not good, but they are not yet mature enough. But if you wish to use it with your already running Hadoop cluster(Apache's hadoop for ex) you might have to do some additional work as Impala is used almost by everybody as a CDH feature.

注意:所有这些事情,根据我的纯粹经验。如果你发现有什么错误或不当之处请不要让我知道。意见和建议,欢迎。我希望这回答了你的一些查询。

Note : All these things as based on solely my experience. If you find something wrong or inappropriate please do let me know. Comments and suggestions are welcome. And I hope this answers some of your queries.

这篇关于快速Hadoop的分析(Cloudera的黑斑羚VS星火/鲨鱼VS阿帕奇钻)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆