为什么一个作业失败,[否留在设备和QUOT空间;但是DF说,否则? [英] Why does a job fail with "No space left on device", but df says otherwise?
问题描述
执行当我洗牌星火作业失败,并说:留在设备上没有空间,但是当我运行 DF -h
它说我有剩余空间!为什么会这样,我怎么能解决这个问题?
When performing a shuffle my Spark job fails and says "no space left on device", but when I run df -h
it says I have free space left! Why does this happen, and how can I fix it?
推荐答案
您还需要监控 DF -i
这显示多少的inode都在使用。
You need to also monitor df -i
which shows how many inodes are in use.
在每台机器上,我们创建M * R代表洗牌,临时文件,其中M = map任务数,R = reduce任务的数量。
on each machine, we create M * R temporary files for shuffle, where M = number of map tasks, R = number of reduce tasks.
https://spark-project.atlassian.net/browse/SPARK-751
如果你确实看到磁盘被耗尽索引节点来解决这个问题,您可以:
If you do indeed see that disks are running out of inodes to fix the problem you can:
- 减小分区(见
合并
与=洗牌虚假
)。 - 人们可以通过合并文件拖放到O(R)的数量。由于不同的文件系统不同的表现我们建议您在
spark.shuffle.consolidateFiles
阅读,看到<一个href=\"https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf\" rel=\"nofollow\">https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf. - 有时你可能只是发现你需要你的DevOps增加FS支持inode数。
- Decrease partitions (see
coalesce
withshuffle = false
). - One can drop the number to O(R) by "consolidating files". As different file-systems behave differently it’s recommended that you read up on
spark.shuffle.consolidateFiles
and see https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf. - Sometimes you may simply find that you need your DevOps to increase the number of inodes the FS supports.
修改
合并文件已经从火花,因为1.6版本中删除。
https://issues.apache.org/jira/browse/SPARK-9808
Consolidating files has been removed from spark since version 1.6. https://issues.apache.org/jira/browse/SPARK-9808
这篇关于为什么一个作业失败,[否留在设备和QUOT空间;但是DF说,否则?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!