为什么 Spark 将 Map 阶段输出保存到本地磁盘? [英] Why does Spark save Map phase output to local disk?

查看:41
本文介绍了为什么 Spark 将 Map 阶段输出保存到本地磁盘?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试深入了解 spark shuffle 过程.当我开始阅读时,我遇到了以下几点.

I'm trying to understand spark shuffle process deeply. When i start reading i came across the following point.

Spark 在完成时将 Map 任务 (ShuffleMapTask) 输出直接写入磁盘.

Spark writes the Map task(ShuffleMapTask) output directly to disk on completion.

我想了解以下关于 Hadoop MapReduce 的内容.

I would like to understand the following w.r.t to Hadoop MapReduce.

  1. 如果 Map-Reduce 和 Spark 都将数据写入本地磁盘,那么 spark shuffle 过程与 Hadoop MapReduce 有何不同?

  1. If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop MapReduce?

既然数据在 Spark 中表示为 RDD,为什么这些输出不保留在节点执行器内存中?

Since data is represented as RDD's in Spark why don't these outputs remain in the node executors memory?

Hadoop MapReduce 和 Spark 的 Map 任务输出有何不同?

How is the output of the Map tasks from Hadoop MapReduce and Spark different?

如果有很多小的中间文件作为输出,spark 如何处理网络和 I/O 瓶颈?

If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?

推荐答案

首先 Spark 不能以严格的 map-reduce 方式工作,并且 map 输出不会写入磁盘,除非它是必要的.将随机文件写入磁盘.

First of all Spark doesn't work in a strict map-reduce manner and map output is not written to disk unless it is necessary. To disk are written shuffle files.

这并不意味着洗牌后的数据不保存在内存中.Spark 中的 Shuffle 文件主要编写以避免在多个下游操作的情况下重新计算.为什么要写入文件系统?至少有两个交错的原因:

It doesn't mean that data after the shuffle is not kept in memory. Shuffle files in Spark are written mostly to avoid re-computation in case of multiple downstream actions. Why to write to a file system at all? There at least two interleaved reasons:

  • 内存是一种宝贵的资源,Spark 中的内存缓存是短暂的.可以在需要时从缓存中清除旧数据.
  • shuffle 是一个昂贵的过程,如果没有必要,我们希望避免.以一种在给定上下文的生命周期内保持其持久性的方式存储 shuffle 数据更有意义.

除了正在进行的低级优化工作和实施细节之外,Shuffle 本身并没有什么不同.它基于相同的基本方法,但具有所有局限性.

Shuffle itself, apart from the ongoing low level optimization efforts and implementation details, isn't different at all. It is based on the same basic approach with all its limitations.

任务与 Hadoo 地图有何不同?Justin Pihony 很好地说明了多个不需要随机播放的转换在单个任务中被压缩在一起.由于这些操作在标准 Scala 迭代器上操作,因此可以通过管道对单个元素进行操作.

How tasks are different form Hadoo maps? As nicely illustrated by Justin Pihony multiple transformations which doesn't require shuffles are squashed together in a single tasks. Since these operate on standard Scala Iterators operations on individual elements can be piped.

关于网络和 I/O 瓶颈,这里没有灵丹妙药.虽然 Spark 可以通过组合转换、在内存中缓存和提供转换感知工作器偏好来减少写入磁盘或混洗的数据量,但它与任何其他分布式框架一样受到相同的限制.

Regarding network and I/O bottlenecks there is no silver bullet here. While Spark can reduce amount of data which is written to disk or shuffled by combining transformations, caching in memory and providing transformation aware worker preferences, it is a subject to the same limitations like any other distributed framework.

这篇关于为什么 Spark 将 Map 阶段输出保存到本地磁盘?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆