已经过去的map-reduce缓存火花唯一的优势? [英] Is caching the only advantage of spark over map-reduce?

查看:183
本文介绍了已经过去的map-reduce缓存火花唯一的优势?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始学习关于Apache Spark和是由框架pssed很IM $ P $。虽然一件事一直困扰我的是,在他们谈论如何星火缓存需要相同的数据的RDDS,因此多个操作所有Spark presentations是不是像地图其他方法更快地降低。

I have started to learn about Apache Spark and am very impressed by the framework. Although one thing which keeps bothering me is that in all Spark presentations they talk about how Spark caches the RDDs and therefore multiple operations which need the same data are faster than other approaches like Map Reduce.

因此​​,我不得不问题是,如果是这种情况,那么只需添加一个缓存引擎MR框架,如纱/ Hadoop的内部。

So the question I had is that if this is the case, then just add a caching engine inside of MR frameworks like Yarn/Hadoop.

为什么要创建一个全新的框架?

Why to create a new framework altogether?

我相信我在这里失去了一些东西,你将能够指出我一些文档这教育了我更多的火花。

I am sure I am missing something here and you will be able to point me to some documentation which educates me more on spark.

推荐答案

+缓存在内存中计算绝对是火花大的事情,但是还有其他的东西。

Caching + in memory computation is definitely a big thing for spark, However there are other things.

RDD(弹性分布式数据集):一个RDD是火花的主要抽象。它允许失败的节点恢复由DAG的重新计算,同时还支持更类似复苏的风格深受检查点的方式来Hadoop的,减少了RDD的依赖关系。存储在DAG火花的工作使得RDD的懒惰计算​​和还可以让火花塞的优化引擎调度的方式,使在性能上有很大的不同的流程。

RDD(Resilient Distributed Data set): an RDD is the main abstraction of spark. It allows recovery of failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of an RDD. Storing a spark job in a DAG allows for lazy computation of RDD's and can also allow spark's optimization engine to schedule the flow in ways that make a big difference in performance.

火花API:Hadoop的马preduce有非常严格的API,不允许尽可能多的多功能性。由于火花提取出大量的底层细节它可以让更多的生产力。还搞什么广播变量和蓄电池比DistributedCache和计数器IMO更灵活。

Spark API: Hadoop MapReduce has a very strict API that doesn't allow for as much versatility. Since spark abstracts away many of the low level details it allows for more productivity. Also things like broadcast variables and accumulators are much more versatile than DistributedCache and counters IMO.

星火流:火花流是基于纸离散流,提出了使用微批次流做窗口计算的新模式。 Hadoop的不支持这样的事。

Spark Streaming: spark streaming is based on a paper Discretized Streams, which proposes a new model for doing windowed computations on streams using micro batches. Hadoop doesn't support anything like this.

正如在存储器计算火花排序行为的产物,因为它是自己的流调度。而使用标准MR你需要一个外部作业调度像阿兹卡班或Oozie的安排复杂流动

As a product of in memory computation spark sort of acts as it's own flow scheduler. Whereas with standard MR you need an external job scheduler like Azkaban or Oozie to schedule complex flows

的Hadoop项目是由马preduce,纱,公共和HDFS的;然而火花试图创建与图书馆一个统一的大数据平台(在同一回购),机器学习,图形处理,数据流,多个SQL类型库,我相信了深刻的学习库是在开始阶段。虽然这一切都不是严格火花的一个特点是火花的计算模式的产物。超光速粒子和BlinkDB是围绕火花建其他两种技术。

The hadoop project is made up of MapReduce, YARN, commons and HDFS; spark however is attempting to create one unified big data platform with libraries (in the same repo) for machine learning, graph processing, streaming, multiple sql type libraries and I believe a deep learning library is in the beginning stages. While none of this is strictly a feature of spark it is a product of spark's computing model. Tachyon and BlinkDB are two other technologies that are built around spark.

这篇关于已经过去的map-reduce缓存火花唯一的优势?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆