Spark:舞台边界上的磁盘I/O说明 [英] Spark: disk I/O on stage boundaries explanation

查看:110
本文介绍了Spark:舞台边界上的磁盘I/O说明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

仅在一些

在每个阶段边界处,数据均由父级中的任务写入磁盘 阶段,然后通过子阶段中的任务通过网络获取. 由于它们会占用大量磁盘和网络I/O,因此阶段边界可能是 价格昂贵,应尽可能避免.

At each stage boundary, data is written to disk by tasks in the parent stages and then fetched over the network by tasks in the child stage. Because they incur heavy disk and network I/O, stage boundaries can be expensive and should be avoided when possible.

每个阶段边界上对磁盘的持久性是否始终适用于HashJoin和SortMergeJoin?为什么Spark(内存引擎)在洗牌之前对tmp文件具有这种持久性?完成任务级恢复还是其他?

P.S.问题主要与Spark SQL API有关,而我也对Streaming&结构化流式传输

P.S. Question relates mainly to Spark SQL API, while I'm also interested in Streaming & Structured Streaming

UPD:在使用Apache Spark进行流处理本书" .在参照页上查找任务故障恢复"和阶段故障恢复"主题.据我了解,为什么=恢复,何时=总是,因为这是Spark Core和Shuffle Service的机制,它负责数据传输.此外,所有Spark的API(SQL,流和结构化流)均基于(Spark Core/RDD的)相同的故障转移保证.因此,我认为这通常是Spark的常见行为

UPD: found a mention and more details of Why does it happens at "Stream Processing with Apache Spark book". Look for "Task Failure Recovery" and "Stage Failure Recovery" topics on referrenced page. As far as I understood, Why = recovery, When = always, since this is mechanics of Spark Core and Shuffle Service, that is responsible for data transfer. Moreover, all Spark's APIs (SQL, Streaming & Structured Streaming) are based on the same failover guarantees (of Spark Core/RDD). So I suppose that this is common behaviour for Spark in general

推荐答案

这是一个很好的问题,因为我们听说了内存中的Spark与Hadoop,因此有些混乱.这些文档很糟糕,但是我跑了几步,环顾四周,找到了最出色的资源,从而验证了观察结果:

It's a good question in that we hear of in-memory Spark vs. Hadoop, so a little confusing. The docs are terrible, but I ran a few things and verified observations by looking around to find a most excellent source: http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html

假定已调用一个动作-如果未声明,则避免明显的注释,假设我们没有在谈论ResultStage和广播联接,那么我们在谈论ShuffleMapStage.我们最初看一个RDD.

Assuming an Action has been called - so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially.

然后,从URL借用:

    涉及洗牌的
  • DAG依赖关系意味着创建单独的舞台.
  • Map操作后跟着Reduce操作和Map等.

当前阶段

  • 所有(融合)地图操作均在阶段内执行.
  • 下一阶段的需求,即减少"操作-例如reduceByKey,表示在地图末尾对输出进行散列或按键排序(K) 当前阶段的操作.
  • 此分组的数据被写入执行器所在的工作器上的磁盘,或与该Cloud版本绑定的存储. (我会 如果数据很小,则认为在内存中是可能的,但这是体系结构性的Spark 文档中所述的方法.)
  • ShuffleManager收到通知,下一个阶段可以使用散列的映射数据. ShuffleManager跟踪所有 所有地图方面的工作完成后,即可确定键/位置.
  • All the (fused) Map operations are performed intra-Stage.
  • The next Stage requirement, a Reduce operation - e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map operations of current Stage.
  • This grouped data is written to disk on the Worker where the Executor is - or storage tied to that Cloud version. (I would have thought in memory was possible, if data is small, but this is an architectural Spark approach as stated from the docs.)
  • The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all keys/locations once all of the map side work is done.

下一阶段

  • 下一阶段是精简,然后通过咨询Shuffle管理器并使用块管理器从这些位置获取数据.
  • 执行人可以在另一名工人上重用或成为新员工,或在同一名工人上另一位执行人.
  • The next Stage, being a reduce, then gets the data from those locations by consulting the Shuffle Manager and using Block Manager.
  • The Executor may be re-used or be a new on another Worker, or another Executor on same Worker.

因此,我的理解是,从架构上讲,即使有足够的内存,阶段也要写入磁盘.在给定Worker有限资源的情况下,对于这种类型的操作发生写到磁盘是有意义的.当然,更重要的一点是"Map Reduce"实施.我总结了出色的帖子,这是您的经典信息.

So, my understanding is that architecturally, Stages mean writing to disk, even if enough memory. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the 'Map Reduce' implementation. I summarized the excellent posting, that is your canonical source.

当然,这种持久性有助于减少容错,减少重新计算工作.

Of course, fault tolerance is aided by this persistence, less re-computation work.

类似的方面也适用于DF.

Similar aspects apply to DFs.

这篇关于Spark:舞台边界上的磁盘I/O说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆