Spark是否将中间混洗输出写入磁盘 [英] Does Spark write intermediate shuffle outputs to disk

查看:82
本文介绍了Spark是否将中间混洗输出写入磁盘的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读学习Spark ,但我听不懂表示Spark的shuffle输出被写入磁盘.请参阅第8章,调整和调试Spark,第148-149页:

I'm reading Learning Spark, and I don't understand what it means that Spark's shuffle outputs are written to disk. See Chapter 8, Tuning and Debugging Spark, pages 148-149:

Spark的内部调度程序可能会截断RDD图的沿袭如果现有的RDD已保留在群集内存中或上方磁盘.这种截断可能发生的第二种情况是当RDD已经成为早期改组的副作用,甚至如果没有明确地保留它.这是一个内幕利用 Spark shuffle进行优化的优化输出被写入磁盘,并利用了很多次的事实重新计算RDD图的各个部分.

Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persisted. This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.

据我了解,存在不同的持久性策略,例如默认的 MEMORY_ONLY ,这意味着中间结果将永远不会持久化到磁盘上.

As I understand there are different persistence policies, for example, the default MEMORY_ONLY which means the intermediate result will never be persisted to the disk.

随机播放何时以及为什么会在磁盘上保留某些内容?如何通过进一步的计算来重用?

When and why will a shuffle persist something on disk? How can that be reused by further computations?

推荐答案

何时

第一次需要洗牌的操作 被评估(操作)并且无法被禁用

When

It happens with when operation that requires shuffle is first time evaluated (action) and cannot be disabled

这是一种优化.改组是Spark中发生的昂贵的事情之一.

This is an optimization. Shuffling is one of the expensive things that happen in Spark.

它会自动重复使用,并在同一RDD上执行任何后续操作.

It is automatically reused with any subsequent action executed on the same RDD.

这篇关于Spark是否将中间混洗输出写入磁盘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆