为什么星火地图相输出写入到本地磁盘? [英] Why Spark Map phase output is written to local disk?

查看:189
本文介绍了为什么星火地图相输出写入到本地磁盘?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想深入了解的火花洗牌的过程。当我开始阅读我碰到下面的点。

星火直接写入地图的任务(ShuffleMapTask)输出到磁盘上完成。

我想了解以下w.r.t到Hadoop的马preduce。


  1. 如果两者的map-reduce和星火将数据写入到本地磁盘火花洗牌的过程则是如何从Hadoop的马preduce不同?

  2. 因为数据是在星火psented作为RDD的,为什么不把这些输出保持在节点执行人内存重新$ P $?

  3. 从Hadoop的马preduce地图任务的输出以及如何星火不同?

  4. 如果有很多小的中间文件作为输出火花如何处理网络和I / O瓶颈?


解决方案

所有Spark首先不严格的map-reduce的方式工作和地图输出不写入磁盘,除非它是必要的。磁盘被写洗牌的文件。

后洗牌不保留在内存这并不意味着该数据。在星火随机播放文件写入大多避免多个下游行动的情况下,重新计算。为什么要写入一个文件系统呢?有至少两个交错的原因:


  • 内存是一种宝贵的资源和内存中缓存的Spark是短暂的。旧的数据可以从缓存在需要的时候被逐出。

  • 洗牌是一个昂贵的过程中,我们要避免,如果没有必要的。它更有意义来存储数据洗牌的方式,这使得在给定上下文的生命周期期间,执着。

洗牌本身,除了正在进行的低级别的优化力度和实施细则,是没有什么不同的。它是基于其所有局限性相同的基本方法。

任务如何不同形式Hadoo地图?正如贾斯汀Pihony 多个转换不需要洗牌在单个任务挤成一团插图精美。由于这些对单个元素标准的Scala迭代运算操作可以输送。

对于网络和I / O瓶颈这里没有银弹。虽然星火可以降低被写入到磁盘或通过在内存组合变换,缓存和提供改造意识到工人preferences洗牌的数据量,它要像任何其他分布式架构相同的限制的主题。

I'm trying to understand spark shuffle process deeply. When i start reading i came across the following point.

Spark writes the Map task(ShuffleMapTask) output directly to disk on completion.

I would like to understand the following w.r.t to Hadoop Mapreduce.

  1. If both Map-Reduce and Spark writes the data to the local disk then how spark shuffle process is different from Hadoop Mapreduce?.
  2. Since data is represented as RDD's in Spark why don't these outputs remain in the node executors memory?.
  3. How the output of the Map tasks from Hadoop Mapreduce and Spark different?.
  4. If there are lot of small intermediate files as output how spark handles the network and I/O bottleneck?.

解决方案

First of all Spark doesn't work in a strict map-reduce manner and map output is not written to disk unless it is necessary. To disk are written shuffle files.

It doesn't mean that data after the shuffle is not kept in memory. Shuffle files in Spark are written mostly to avoid re-computation in case of multiple downstream actions. Why to write to a file system at all? There at least two interleaved reasons:

  • memory is a valuable resource and in-memory caching in Spark is ephemeral. Old data can be evicted from cache when needed.
  • shuffle is an expensive process we want to avoid if not necessary. It makes more sense to store shuffle data in a manner which makes it persistent during a lifetime of a given context.

Shuffle itself, apart from the ongoing low level optimization efforts and implementation details, isn't different at all. It is based on the same basic approach with all its limitations.

How tasks are different form Hadoo maps? As nicely illustrated by Justin Pihony multiple transformations which doesn't require shuffles are squashed together in a single tasks. Since these operate on standard Scala Iterators operations on individual elements can be piped.

Regarding network and I/O bottlenecks there is no silver bullet here. While Spark can reduce amount of data which is written to disk or shuffled by combining transformations, caching in memory and providing transformation aware worker preferences, it is a subject to the same limitations like any other distributed framework.

这篇关于为什么星火地图相输出写入到本地磁盘?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆