什么时候更喜欢Hadoop MapReduce而不是Spark? [英] When to prefer Hadoop MapReduce over Spark?

查看:60
本文介绍了什么时候更喜欢Hadoop MapReduce而不是Spark?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

非常简单的问题:在哪些情况下,我应该选择Hadoop MapReduce而不是Spark?(我希望这个问题还没有被问到-至少我没有找到...)

very simple questions: in which cases should I prefer Hadoop MapReduce over Spark? (I hope this question has not been asked yet - at least I didn't find it...)

我目前正在对这两个处理框架进行比较,到目前为止,根据我所阅读的内容,似乎每个人都建议使用Spark.这也符合您的经验吗?还是可以列举出MapReduce比Spark表现更好的用例?

I am currently doing a comparison of those two processing frameworks and from what I have read so far, everybody seems to suggest to use Spark. Does that also conform to your experience? Or can you name use cases where MapReduce performes better then Spark?

与Spark进行同一任务时,我需要更多的资源(尤其是RAM),然后需要MapReduce吗?

Would I need more ressources (esp. RAM) for the same task with Spark then I would need for MapReduce?

感谢和问候!

推荐答案

Spark是对传统MapReduce的重大改进.

Spark is a great improvement over traditional MapReduce.

您何时会在Spark上使用MapReduce?

When would you use MapReduce over Spark?

当您以MapReduce范例编写的遗留程序非常复杂,以至于您不想对其进行重新编程.同样,如果您的问题不在于分析数据,那么Spark可能不适合您.我能想到的一个例子是用于Web爬网,有一个很棒的Apache项目Apache Nutch,它建立在Hadoop而不是Spark上.

When you have a legacy program written in the MapReduce paradigm that is so complex that you do not want to reprogram it. Also if your problem is not about analyzing data then Spark might not be right for you. One example I can think of is for web crawling, there is a great Apache project called Apache Nutch, that is built on Hadoop not Spark.

我何时会在MapReduce上使用Spark?

When would I use Spark over MapReduce?

自2012年以来……自从我开始使用Spark之后,我再也不想回头了.将我的知识扩展到Java之外并学习Scala也是一个很大的动力.Spark中的许多操作只需较少的字符即可完成.而且,使用Scala/REPL可以更好地快速生成代码.Hadoop有Pig,但随后您必须学习"Pig Latin",在其他任何地方都永远不会有用...

Ever since 2012... Ever since I started using Spark I haven't wanted to go back. It has also been a great motivation to expand my knowledge beyond Java and to learn Scala. A lot of the operations in Spark take less characters to complete. Also, using Scala/REPL is so much better to rapidly produce code. Hadoop has Pig, but then you have to learn "Pig Latin", which will never be useful anywhere else...

如果您想在数据分析中使用Python库,我发现让Python与Spark和MapReduce一起使用更容易.我也非常喜欢使用IPython Notebook之类的东西.在我起步时,Spark就学到了我学习Scala的知识,将IPython Notebook与Spark一起使用激励了我学习PySpark.它没有所有功能,但是大多数可以由Python软件包弥补.

If you want to use Python Libs in your data analysis, I find it easier to get Python working with Spark, and MapReduce. I also REALLY like using something like IPython Notebook. As much as Spark learned me to learn Scala when I started, using IPython Notebook with Spark motivated me to learn PySpark. It doesn't have all the functionality, but most of it can be made up for with Python packages.

Spark现在还具有Spark SQL,它与Hive向后兼容.这使您可以使用Spark在接近SQL查询的情况下运行.我认为这比尝试学习HiveQL要好得多,因为HiveQL足够不同以至于所有内容都是特定于它的.使用Spark SQL,通常可以避免使用常规SQL建议来解决问题.

Spark also now features Spark SQL, which is backwardly compatible with Hive. This lets you use Spark, to run close to SQL queries. I think this is much better then trying to learn HiveQL, which is different enough that everything is specific to it. With Spark SQL, you can usually get away with using general SQL advice to solve issues.

最后,Spark还具有用于机器学习的MLLib,这是对Apache Mahout的重大改进.

Lastly, Spark also has MLLib, for machine learning, which is a great improvement over Apache Mahout.

Spark的最大问题:Internet上没有很多故障排除提示.由于Spark是新发布的,因此缺少有关问题的文档...与AmpLabs/Databricks(加州大学伯克利分校的Spark的创建者及其咨询业务)的朋友结伴,并利用他们的论坛寻求支持是一个好主意..

Largest Spark issue: the internet is not full of troubleshooting tips. Since Spark is new, the documentation about issues is a little lacking... It's a good idea to buddy up with someone from AmpLabs/Databricks (the creators of Spark from UC Berkeley, and their consulting business), and utilize their forums for support.

这篇关于什么时候更喜欢Hadoop MapReduce而不是Spark?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆