Spark:随机写入,随机溢出(内存)和随机溢出(磁盘)之间的区别? [英] Spark: Difference between Shuffle Write, Shuffle spill (memory), Shuffle spill (disk)?

查看:1069
本文介绍了Spark:随机写入,随机溢出(内存)和随机溢出(磁盘)之间的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下火花工作,试图将所有内容保留在内存中:

I have the following spark job, trying to keep everything in memory:

val myOutRDD = myInRDD.flatMap { fp =>
  val tuple2List: ListBuffer[(String, myClass)] = ListBuffer()
        :

  tuple2List
}.persist(StorageLevel.MEMORY_ONLY).reduceByKey { (p1, p2) =>
   myMergeFunction(p1,p2)
}.persist(StorageLevel.MEMORY_ONLY)


但是,当我查看作业跟踪器时,仍然有很多Shuffle写入和Shuffle溢出到磁盘的情况...


However, when I looked in to the job tracker, I still have a lot of Shuffle Write and Shuffle spill to disk ...

Total task time across all tasks: 49.1 h
Input Size / Records: 21.6 GB / 102123058
Shuffle write: 532.9 GB / 182440290
Shuffle spill (memory): 370.7 GB
Shuffle spill (disk): 15.4 GB


然后作业失败,因为"no space left on device" ...我想知道这里是否要写入 532.9 GB随机播放,是将其写入磁盘还是内存?


Then the job failed because "no space left on device" ... I am wondering for the 532.9 GB Shuffle write here, is it written to disk or memory?

此外,为什么在我特别要求将它们保留在内存中的同时,仍然有15.4 G数据泄漏到磁盘上?

Also, why there are still 15.4 G data spill to the disk while I specifically ask to keep them in the memory?

谢谢!

推荐答案

如果您多次不访问RDD,则代码中的persist调用将被完全浪费.如果您永不访问,那么存储东西有什么意义呢?缓存与随机播放行为无关,除了可以通过保留其输出缓存来避免重做随机播放之外.

The persist calls in your code are entirely wasted if you don't access the RDD multiple times. What's the point of storing something if you never access it? Caching has no bearing on shuffle behavior other than you can avoid re-doing shuffles by keeping their output cached.

随机播放溢出由 spark.shuffle.spillspark.shuffle.memoryFraction 配置参数.如果启用了spill(默认情况下),则随机播放的文件如果开始使用超过memoryFraction所指定的值(默认为20%)的文件,则会溢出到磁盘上.

Shuffle spill is controlled by the spark.shuffle.spill and spark.shuffle.memoryFraction configuration parameters. If spill is enabled (it is by default) then shuffle files will spill to disk if they start using more than given by memoryFraction (20% by default).

这些指标非常令人困惑.我对代码随机溢出(内存)" 是随着事物溢出到磁盘而释放的内存量. 随机溢出(磁盘)"的"noreferrer>代码 看起来是实际写入磁盘的数量.通过

The metrics are very confusing. My reading of the code is that "Shuffle spill (memory)" is the amount of memory that was freed up as things were spilled to disk. The code for "Shuffle spill (disk)" looks like it's the amount actually written to disk. By the code for "Shuffle write" I think it's the amount written to disk directly — not as a spill from a sorter.

这篇关于Spark:随机写入,随机溢出(内存)和随机溢出(磁盘)之间的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆