缓存是 spark 相对于 map-reduce 的唯一优势吗? [英] Is caching the only advantage of spark over map-reduce?

查看:11
本文介绍了缓存是 spark 相对于 map-reduce 的唯一优势吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经开始学习 Apache Spark,并且对这个框架印象非常深刻.尽管一直困扰着我的一件事是,在所有 Spark 演示文稿中,他们都讨论了 Spark 如何缓存 RDD,因此需要相同数据的多个操作比 Map Reduce 等其他方法更快.

I have started to learn about Apache Spark and am very impressed by the framework. Although one thing which keeps bothering me is that in all Spark presentations they talk about how Spark caches the RDDs and therefore multiple operations which need the same data are faster than other approaches like Map Reduce.

所以我的问题是,如果是这种情况,那么只需在 Yarn/Hadoop 等 MR 框架内添加一个缓存引擎即可.

So the question I had is that if this is the case, then just add a caching engine inside of MR frameworks like Yarn/Hadoop.

为什么要完全创建一个新框架?

Why to create a new framework altogether?

我确定我在这里遗漏了一些东西,您可以向我指出一些文档,这些文档可以让我更多地了解 Spark.

I am sure I am missing something here and you will be able to point me to some documentation which educates me more on spark.

推荐答案

缓存+内存计算对于spark来说绝对是一件大事,但还有其他事情.

Caching + in memory computation is definitely a big thing for spark, However there are other things.

RDD(Resilient Distributed Data set):RDD是spark的主要抽象.它允许通过重新计算 DAG 来恢复故障节点,同时还通过检查点支持与 Hadoop 更相似的恢复风格,以减少 RDD 的依赖关系.将 Spark 作业存储在 DAG 中可以实现 RDD 的延迟计算,还可以让 Spark 的优化引擎以对性能产生很大影响的方式来安排流程.

RDD(Resilient Distributed Data set): an RDD is the main abstraction of spark. It allows recovery of failed nodes by re-computation of the DAG while also supporting a more similar recovery style to Hadoop by way of checkpointing, to reduce the dependencies of an RDD. Storing a spark job in a DAG allows for lazy computation of RDD's and can also allow spark's optimization engine to schedule the flow in ways that make a big difference in performance.

Spark API:Hadoop MapReduce 有一个非常严格的 API,无法实现如此多的多功能性.由于 spark 抽象了许多低级细节,因此可以提高生产力.此外,广播变量和累加器之类的东西比 DistributedCache 和计数器 IMO 更通用.

Spark API: Hadoop MapReduce has a very strict API that doesn't allow for as much versatility. Since spark abstracts away many of the low level details it allows for more productivity. Also things like broadcast variables and accumulators are much more versatile than DistributedCache and counters IMO.

Spark Streaming:Spark Streaming 基于 Discretized Streams 论文,该论文提出了一种使用微批处理对流进行窗口计算的新模型.Hadoop 不支持这样的东西.

Spark Streaming: spark streaming is based on a paper Discretized Streams, which proposes a new model for doing windowed computations on streams using micro batches. Hadoop doesn't support anything like this.

作为内存计算的产物,火花有点像它自己的流程调度器.而使用标准 MR,您需要一个外部作业调度程序(如 Azkaban 或 Oozie)来调度复杂的流程

As a product of in memory computation spark sort of acts as it's own flow scheduler. Whereas with standard MR you need an external job scheduler like Azkaban or Oozie to schedule complex flows

hadoop 项目由 MapReduce、YARN、commons 和 HDFS 组成;然而,spark 正试图创建一个统一的大数据平台,其中包含用于机器学习、图形处理、流媒体、多个 sql 类型库的库(在同一个 repo 中),我相信深度学习库正处于起步阶段.虽然这些都不是严格意义上的 spark 特性,但它是 spark 计算模型的产物.Tachyon 和 BlinkDB 是另外两种围绕 Spark 构建的技术.

The hadoop project is made up of MapReduce, YARN, commons and HDFS; spark however is attempting to create one unified big data platform with libraries (in the same repo) for machine learning, graph processing, streaming, multiple sql type libraries and I believe a deep learning library is in the beginning stages. While none of this is strictly a feature of spark it is a product of spark's computing model. Tachyon and BlinkDB are two other technologies that are built around spark.

这篇关于缓存是 spark 相对于 map-reduce 的唯一优势吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆