Spark:舞台边界上的磁盘I/O说明 [英] Spark: disk I/O on stage boundaries explanation
问题描述
仅在一些
在每个阶段边界处,数据均由父级中的任务写入磁盘
阶段,然后通过子阶段中的任务通过网络获取.
由于它们会占用大量磁盘和网络I/O,因此阶段边界可能是
价格昂贵,应尽可能避免.
At each stage boundary, data is written to disk by tasks in the parent
stages and then fetched over the network by tasks in the child stage.
Because they incur heavy disk and network I/O, stage boundaries can be
expensive and should be avoided when possible. 每个阶段边界上对磁盘的持久性是否始终适用于HashJoin和SortMergeJoin?为什么Spark(内存引擎)在洗牌之前对tmp文件具有这种持久性?完成任务级恢复还是其他? P.S.问题主要与Spark SQL API有关,而我也对Streaming&结构化流式传输 P.S. Question relates mainly to Spark SQL API, while I'm also interested in Streaming & Structured Streaming UPD:在使用Apache Spark进行流处理本书" .在参照页上查找任务故障恢复"和阶段故障恢复"主题.据我了解,为什么=恢复,何时=总是,因为这是Spark Core和Shuffle Service的机制,它负责数据传输.此外,所有Spark的API(SQL,流和结构化流)均基于(Spark Core/RDD的)相同的故障转移保证.因此,我认为这通常是Spark的常见行为 UPD: found a mention and more details of Why does it happens at "Stream Processing with Apache Spark book". Look for "Task Failure Recovery" and "Stage Failure Recovery" topics on referrenced page. As far as I understood, Why = recovery, When = always, since this is mechanics of Spark Core and Shuffle Service, that is responsible for data transfer. Moreover, all Spark's APIs (SQL, Streaming & Structured Streaming) are based on the same failover guarantees (of Spark Core/RDD). So I suppose that this is common behaviour for Spark in general 这是一个很好的问题,因为我们听说了内存中的Spark与Hadoop,因此有些混乱.这些文档很糟糕,但是我跑了几步,环顾四周,找到了最出色的资源,从而验证了观察结果: It's a good question in that we hear of in-memory Spark vs. Hadoop, so a little confusing. The docs are terrible, but I ran a few things and verified observations by looking around to find a most excellent source: http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html 假定已调用一个动作-如果未声明,则避免明显的注释,假设我们没有在谈论ResultStage和广播联接,那么我们在谈论ShuffleMapStage.我们最初看一个RDD. Assuming an Action has been called - so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially. 然后,从URL借用: 当前阶段 下一阶段 因此,我的理解是,从架构上讲,即使有足够的内存,阶段也要写入磁盘.在给定Worker有限资源的情况下,对于这种类型的操作发生写到磁盘是有意义的.当然,更重要的一点是"Map Reduce"实施.我总结了出色的帖子,这是您的经典信息. So, my understanding is that architecturally, Stages mean writing to disk, even if enough memory. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the 'Map Reduce' implementation. I summarized the excellent posting, that is your canonical source. 当然,这种持久性有助于减少容错,减少重新计算工作. Of course, fault tolerance is aided by this persistence, less re-computation work. 类似的方面也适用于DF. Similar aspects apply to DFs. 这篇关于Spark:舞台边界上的磁盘I/O说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
推荐答案
涉及洗牌的