为什么作业会因“设备上没有剩余空间"而失败,而 df 却另有说法? [英] Why does a job fail with "No space left on device", but df says otherwise?
问题描述
执行随机播放时,我的 Spark 作业失败并显示设备上没有剩余空间",但是当我运行 df -h
时,它说我还有可用空间!为什么会发生这种情况,我该如何解决?
When performing a shuffle my Spark job fails and says "no space left on device", but when I run df -h
it says I have free space left! Why does this happen, and how can I fix it?
推荐答案
您还需要监视 df -i
以显示正在使用的 inode 数量.
You need to also monitor df -i
which shows how many inodes are in use.
在每台机器上,我们创建 M * R 个临时文件用于 shuffle,其中 M = map 任务数,R = reduce 任务数.
on each machine, we create M * R temporary files for shuffle, where M = number of map tasks, R = number of reduce tasks.
https://spark-project.atlassian.net/browse/SPARK-751
如果您确实发现磁盘中的 inode 不足以解决问题,您可以:
If you do indeed see that disks are running out of inodes to fix the problem you can:
- 减少分区(参见
coalesce
和shuffle = false
). - 可以通过合并文件"将数字降低到 O(R).由于不同的文件系统行为不同,建议您阅读
spark.shuffle.consolidateFiles
并查看 https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf一>. - 有时您可能只是发现您需要 DevOps 来增加 FS 支持的 inode 数量.
- Decrease partitions (see
coalesce
withshuffle = false
). - One can drop the number to O(R) by "consolidating files". As different file-systems behave differently it’s recommended that you read up on
spark.shuffle.consolidateFiles
and see https://spark-project.atlassian.net/secure/attachment/10600/Consolidating%20Shuffle%20Files%20in%20Spark.pdf. - Sometimes you may simply find that you need your DevOps to increase the number of inodes the FS supports.
编辑
从 1.6 版起,合并文件已从 Spark 中删除.https://issues.apache.org/jira/browse/SPARK-9808
Consolidating files has been removed from spark since version 1.6. https://issues.apache.org/jira/browse/SPARK-9808
这篇关于为什么作业会因“设备上没有剩余空间"而失败,而 df 却另有说法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!