Spark:阶段边界上的磁盘 I/O 解释 [英] Spark: disk I/O on stage boundaries explanation

查看:36
本文介绍了Spark:阶段边界上的磁盘 I/O 解释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在官方文档中找不到关于Spark临时数据持久化磁盘的信息,只能在这个:

I can't find the information about Spark temporary data persistance on disk in official docs, only at some Spark optimization articles like this:

在每个阶段边界,数据由父级中的任务写入磁盘阶段,然后由子阶段中的任务通过网络获取.因为它们会导致大量的磁盘和网络 I/O,所以阶段边界可以是昂贵,应尽可能避免.

At each stage boundary, data is written to disk by tasks in the parent stages and then fetched over the network by tasks in the child stage. Because they incur heavy disk and network I/O, stage boundaries can be expensive and should be avoided when possible.

每个阶段边界上的磁盘持久性是否始终适用于:HashJoin 和 SortMergeJoin?为什么 Spark(内存引擎)在 shuffle 之前对 tmp 文件进行持久化?这样做是为了任务级恢复还是其他什么?

附言问题主要与 Spark SQL API 有关,而我也对 Streaming & 感兴趣.结构化流

P.S. Question relates mainly to Spark SQL API, while I'm also interested in Streaming & Structured Streaming

UPD:在 使用 Apache Spark 进行流处理".在参考页面上查找任务故障恢复"和阶段故障恢复"主题.据我了解,Why = recovery,When = always,因为这是 Spark Core 和 Shuffle Service 的机制,负责数据传输.此外,所有 Spark 的 API(SQL、流和结构化流)都基于相同的故障转移保证(Spark Core/RDD).所以我想这通常是 Spark 的常见行为

UPD: found a mention and more details of Why does it happens at "Stream Processing with Apache Spark book". Look for "Task Failure Recovery" and "Stage Failure Recovery" topics on referrenced page. As far as I understood, Why = recovery, When = always, since this is mechanics of Spark Core and Shuffle Service, that is responsible for data transfer. Moreover, all Spark's APIs (SQL, Streaming & Structured Streaming) are based on the same failover guarantees (of Spark Core/RDD). So I suppose that this is common behaviour for Spark in general

推荐答案

这是一个很好的问题,因为我们听说了内存中 Spark 与 Hadoop,所以有点令人困惑.文档很糟糕,但我运行了一些东西并通过环顾四周以找到最优秀的来源来验证观察结果:http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html

It's a good question in that we hear of in-memory Spark vs. Hadoop, so a little confusing. The docs are terrible, but I ran a few things and verified observations by looking around to find a most excellent source: http://hydronitrogen.com/apache-spark-shuffles-explained-in-depth.html

假设已经调用了一个 Action - 以避免在没有说明的情况下出现明显的评论,假设我们不是在谈论 ResultStage 和广播连接,那么我们在谈论 ShuffleMapStage.我们首先看一个 RDD.

Assuming an Action has been called - so as to avoid the obvious comment if this is not stated, assuming we are not talking about ResultStage and a broadcast join, then we are talking about ShuffleMapStage. We look at an RDD initially.

然后,从url借用:

  • 涉及 shuffle 的 DAG 依赖意味着创建一个单独的 Stage.
  • Map 操作之后是 Reduce 操作和 Map 等等.

当前阶段

  • 所有(融合)Map 操作都在阶段内执行.
  • 下一个 Stage 要求,Reduce 操作 - 例如一个reduceByKey,表示输出在Map的末尾散列或按键(K)排序当前阶段的操作.
  • 此分组数据将写入 Executor 所在的 Worker 上的磁盘 - 或与该 Cloud 版本相关的存储.(我会认为在内存中是可能的,如果数据很小,但这是一个架构 Spark方法如文档中所述.)
  • ShuffleManager 会收到散列映射数据可供下一个阶段使用的通知.ShuffleManager 跟踪所有完成所有地图方面的工作后的键/位置.
  • All the (fused) Map operations are performed intra-Stage.
  • The next Stage requirement, a Reduce operation - e.g. a reduceByKey, means the output is hashed or sorted by key (K) at end of the Map operations of current Stage.
  • This grouped data is written to disk on the Worker where the Executor is - or storage tied to that Cloud version. (I would have thought in memory was possible, if data is small, but this is an architectural Spark approach as stated from the docs.)
  • The ShuffleManager is notified that hashed, mapped data is available for consumption by the next Stage. ShuffleManager keeps track of all keys/locations once all of the map side work is done.

下一阶段

  • 下一个阶段是一个 reduce,然后通过咨询 Shuffle Manager 并使用 Block Manager 从这些位置获取数据.
  • Executor 可能会被重复使用或成为另一个 Worker 上的新对象,或者是同一 Worker 上的另一个 Executor.

所以,我的理解是,在架构上,阶段意味着写入磁盘,即使有足够的内存.给定 Worker 的有限资源,对于这种类型的操作写入磁盘是有道理的.当然,更重要的一点是Map Reduce"实现.我总结了出色的帖子,这是您的规范来源.

So, my understanding is that architecturally, Stages mean writing to disk, even if enough memory. Given finite resources of a Worker it makes sense that writing to disk occurs for this type of operation. The more important point is, of course, the 'Map Reduce' implementation. I summarized the excellent posting, that is your canonical source.

当然,这种持久性有助于容错,减少重新计算工作.

Of course, fault tolerance is aided by this persistence, less re-computation work.

类似的方面适用于 DF.

Similar aspects apply to DFs.

这篇关于Spark:阶段边界上的磁盘 I/O 解释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆