Apache Spark与Hadoop方法有何不同? [英] How is Apache Spark different from the Hadoop approach?

查看:85
本文介绍了Apache Spark与Hadoop方法有何不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家都在说Spark正在使用内存,因为它比Hadoop快得多。



我从Spark文档中不能理解真正的区别


  • 什么情况下Spark将数据存储在内存中,而Hadoop不会?

  • 什么如果数据对于内存来说太大,会发生什么?在这种情况下,它与Hadoop有多相似?


解决方案

Spark尝试将内容保存在内存中,而MapReduce会将事物进出磁盘。主存中的平均中间输出存储,其中hadoop将中间结果存储在辅助内存中。 MapReduce插入障碍,并且将事情写入磁盘并将其读回需要很长时间。因此MapReduce可能会很慢并且很费力。消除这一限制使得Spark数量级更快。对于像Hive这样的SQL引擎,通常需要一系列MapReduce操作,这需要大量的I / O活动。在磁盘上,关闭磁盘,关闭磁盘。当在Spark上运行类似的操作时,Spark可以将内容保存在内存中,而无需I / O,因此您可以快速操作相同的数据。这导致了性能的显着提升,这意味着Spark绝对会让我们至少进入互动类。为了记录,MapReduce对磁盘进行所有记录都有一些好处,因为将所有内容记录到磁盘都允许在发生故障后重新启动。如果你正在运行一个多小时的工作,你不想从头开始。对于以秒或分钟运行的Spark上的应用程序,重新启动显然不是问题。



为Spark开发更容易。就你如何给它指导紧缩数据而言,Spark更强大和更有表现力。 Spark具有Map和Reduce功能,比如MapReduce,但它增加了Filter,Join和Group-by等其他功能,所以为Spark开发起来更容易。

  Spark还增加了用于执行诸如机器学习,流媒体,图形编程和SQL之类的库


Everyone is saying that Spark is using the memory and because of that it's much faster than Hadoop.

I didn't understand from the Spark documentation what the real difference is.

  • Where does Spark stores the data in memory while Hadoop doesn't?
  • What happens if the data is too big for the memory? How similar would it be to Hadoop in that case?

解决方案

Spark tries to keep things in memory, whereas MapReduce keeps shuffling things in and out of disk. Mean intermediate output store in main memory where as hadoop store intermediate result in secondary memory. MapReduce inserts barriers, and it takes a long time to write things to disk and read them back. Hence MapReduce can be slow and laborious. The elimination of this restriction makes Spark orders of magnitude faster. For things like SQL engines such as Hive, a chain of MapReduce operations is usually needed, and this requires a lot of I/O activity. On to disk, off of disk—on to disk, off of disk. When similar operations are run on Spark, Spark can keep things in memory without I/O, so you can keep operating on the same data quickly. This results in dramatic improvements in performance, and that means Spark definitely moves us into at least the interactive category. For the record, there are some benefits to MapReduce doing all that recording to disk — as recording everything to disk allows for the possibility of restarting after failure. If you’re running a multi-hour job, you don’t want to begin again from scratch. For applications on Spark that run in the seconds or minutes, restart is obviously less of an issue.

It’s easier to develop for Spark. Spark is much more powerful and expressive in terms of how you give it instructions to crunch data. Spark has a Map and a Reduce function like MapReduce, but it adds others like Filter, Join and Group-by, so it’s easier to develop for Spark.

Spark also adds libraries for doing things like machine learning, streaming, graph programming and SQL

这篇关于Apache Spark与Hadoop方法有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆