溢出到磁盘并随机写入火花 [英] Spill to disk and shuffle write spark

查看:56
本文介绍了溢出到磁盘并随机写入火花的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对spill to diskshuffle write感到困惑.使用默认的Sort shuffle管理器,我们使用appendOnlyMap进行汇总和合并分区记录,对吗?然后,当执行内存填满时,我们开始对映射进行排序,将其溢出到磁盘上,然后为下一次溢出(如果发生)清理该映射,我的问题是:

I'm getting confused about spill to disk and shuffle write. Using the default Sort shuffle manager, we use an appendOnlyMap for aggregating and combine partition records, right? Then when execution memory fill up, we start sorting map, spilling it to disk and then clean up the map for the next spill(if occur), my questions are :

  • 溢出到磁盘和随机写入之间有什么区别?它们基本上包括在本地文件系统上创建文件并进行记录.

  • What is the difference between spill to disk and shuffle write? They consist basically in creating file on local file system and also record.

承认是不同的,因此溢出记录的排序方式是因为穿过记录传递,而改编不记录是因为它们没有从地图传递.

Admit are different, so Spill records are sorted because the are passed through the map, instead shuffle write records no because they don't pass from the map.

谢谢.

乔治(Giorgio)

Giorgio

推荐答案

spill to diskshuffle write是两个不同的东西

spill to disk-数据从主机RAM移到主机磁盘-在计算机上没有足够的RAM并将其部分RAM放入磁盘的情况下使用

spill to disk - Data move from Host RAM to Host Disk - is used when there is no enough RAM on your machine, and it place part of its RAM into disk

http://spark.apache.org/faq.html

我的数据是否需要容纳在内存中才能使用Spark?

不.如果数据不适合内存,Spark的操作员会将数据泄漏到磁盘上, 使其可以在任何大小的数据上正常运行.同样,缓存的数据集 内存不足的内存溢出到磁盘或在磁盘上重新计算 由RDD的存储级别决定是否在需要时执行.

No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

shuffle write-数据从执行程序移动到另一个执行程序-当数据需要在执行程序之间移动时(例如,由于JOIN,groupBy等原因),使用

shuffle write - Data move from Executor(s) to another Executor(s) - is used when data needs to move between executors (e.g. due to JOIN, groupBy, etc)

更多数据可在此处找到:

more data can be found here:

  • https://0x0fff.com/spark-architecture-shuffle/
  • http://blog.cloudera.com/blog/2015/05/working-with-apache-spark-or-how-i-learned-to-stop-worrying-and-love-the-shuffle/

一个边缘案例可以帮助解决此问题:

An edge case example which might help clearing this issue:

  • 您有10位执行者
  • 每个执行器具有100GB RAM
  • 数据大小为1280MB,并分为10个分区
  • 每个执行程序都保存128MB的数据.

假设数据持有一个密钥,则执行groupByKey将把所有数据放入一个分区. Shuffle size将为9 * 128MB(9位执行者会将其数据传输到最后一位执行者),并且不会出现任何spill to disk,因为执行者具有100GB的RAM和仅1GB的数据

Assuming that the data holds one key, Performing groupByKey, will bring all the data into one partition. Shuffle size will be 9*128MB (9 executors will transfer their data into the last executor), and there won't be any spill to disk as the executor has 100GB of RAM and only 1GB of data

关于 AppendOnlyMap :

AppendOnlyMap代码所示(请参见上文)-此功能是 针对以下内容优化的简单开放哈希表的低层实现 仅用于追加的用例,其中从不删除键,但删除值 每个键都可以更改.

As written in the AppendOnlyMap code (see above) - this function is a low level implementation of a simple open hash table optimized for the append-only use case, where keys are never removed, but the value for each key may be changed.

两个不同的模块使用相同的低级功能这一事实并不意味着这些功能在高级别相关.

The fact that two different modules uses the same low-level function doesn't mean that those functions are related in hi-level.

这篇关于溢出到磁盘并随机写入火花的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆