由于java.io.FileNotFoundException,Google的Dataproc上的Spark失败:/hadoop/yarn/nm-local-dir/usercache/root/appcache/ [英] Spark on Google's Dataproc failed due to java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/

查看:906
本文介绍了由于java.io.FileNotFoundException,Google的Dataproc上的Spark失败:/hadoop/yarn/nm-local-dir/usercache/root/appcache/的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经通过Zeppelin和Dataproc控制台在Dataproc上使用Spark/Hadoop已有几个月了,但是最近我遇到了以下错误.

I've been using Spark/Hadoop on Dataproc for months both via Zeppelin and Dataproc console but just recently I got the following error.

Caused by: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1530998908050_0001/blockmgr-9d6a2308-0d52-40f5-8ef3-0abce2083a9c/21/temp_shuffle_3f65e1ca-ba48-4cb0-a2ae-7a81dcdcf466 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

首先,我在Zeppelin笔记本电脑上遇到此类错误,并认为这是Zeppelin问题.但是,此错误似乎是随机发生的.我怀疑这与其中一个Spark工作者无法在该路径中写入有关.因此,我用Google搜索并建议删除每个Spark工作程序上/hadoop/yarn/nm-local-dir/usercache/下的文件,并检查每个工作程序上是否有可用的磁盘空间.这样做之后,我有时仍然会出现此错误.我还在Dataproc上运行了Spark作业,也发生了类似的错误.我使用的是Dataproc图片版本1.2.

First, I got this type of error on Zeppelin notebook and thought it was Zeppelin issue. This error however, seems to occur randomly. I suspect It has something to do with one of the Spark workers not being able to write in that path. So, I googled and was suggested to delete files under /hadoop/yarn/nm-local-dir/usercache/ on each Spark worker and check if there are available disk space on each worker. After doing so, I still sometimes had this error. I also ran a Spark job on Dataproc, this similar error also occurred. I'm on Dataproc image version 1.2.

谢谢

Peeranat F.

Peeranat F.

推荐答案

好.我们在GCP上遇到了同样的问题,其原因是资源抢占.

Ok. We faced the same issue on GCP and the reason for this is resource preemption.

在GCP中,可以通过以下两种策略来抢占资源

In GCP, resource preemption can be done by following two strategies,

  1. 节点抢占-删除群集中的节点并替换它们
  2. 容器抢占-删除纱线容器.

此设置由您的管理员/开发人员在GCP中完成,以优化集群的成本和资源利用率,尤其是在共享集群的情况下.

This setting is done in GCP by your admin/ dev ops person to optimize cost and resource utilization of cluster, specially if it is being shared.

堆栈跟踪告诉我的是它的节点抢占.由于某些时候被抢占的节点是您的驱动程序节点,导致该应用程序一起失败,因此该错误是随机发生的.

What you're stack trace tells me is that its node preemption. This error occurs randomly because some times the node that get preempted is your driver node that causes the app to fail all together.

您可以在GCP控制台中看到哪些节点是可抢占的.

You can see which nodes are preemptable in your GCP console.

这篇关于由于java.io.FileNotFoundException,Google的Dataproc上的Spark失败:/hadoop/yarn/nm-local-dir/usercache/root/appcache/的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆