间歇性随机播放时,由于找不到文件而导致Spark Job崩溃 [英] Spark Job crash due to File Not found when shuffle intermittently

查看:74
本文介绍了间歇性随机播放时,由于找不到文件而导致Spark Job崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有几个Spark作业,包括批处理作业和Stream作业,用于处理系统日志并进行分析.我们正在使用Kafka作为连接每个作业的管道.

I have several Spark jobs including both batch job and Stream jobs to process the system log and analyze them. We are using Kafka as the pipeline to connect each jobs.

一旦升级到Spark 2.1.0 + Spark Kafka Streaming 010,我发现某些作业(无论是批处理还是流式传输)都随机抛出异常(在运行数小时后或仅在20分钟内运行).谁能给我一些关于如何找出真正根本原因的建议?(看起来有很多帖子正在讨论这个问题,但是该解决方案对我来说似乎不是很有用...)

Once upgrade to Spark 2.1.0 + Spark Kafka Streaming 010, I found some of the jobs(both batch or streaming) are thrown below exceptions randomly(either after several hours run or just run in 20 mins). Can anyone give me some suggestions about how to figure out the real root cause? (Looks like there are many posts are discussing this, but the solution seems not very useful for me...)

这是由于Spark配置问题还是代码错误?我无法粘贴所有职位代码,因为太多了.

Is this due to Spark configuration issue or code bug? I can not paste all my jobs codes as there are too much.

00:30:04,510警告-17/07/22 00:30:04警告TaskSetManager:在阶段1518490.0中丢失了任务60.0(TID 338070、10.133.96.21,执行程序0):java.io.FileNotFoundException:/mnt/mesos/work_dir/slaves/20160924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a-4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6(无此类文件或目录)00:30:04,510 WARN-at java.io.FileOutputStream.open0(Native Method)00:30:04,510警告-java.io.FileOutputStream.open(FileOutputStream.java:270)00:30:04,510警告-java.io.FileOutputStream.(FileOutputStream.java:213)00:30:04,510警告-java.io.FileOutputStream.(FileOutputStream.java:162)00:30:04,510警告-org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144)00:30:04,510警告-org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128)00:30:04,510警告-org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)00:30:04,510警告-org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)00:30:04,510警告-org.apache.spark.scheduler.Task.run(Task.scala:99)00:30:04,510警告-org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282)00:30:04,510警告-java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)00:30:04,510警告-java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)00:30:04,510警告-java.lang.Thread.run(Thread.java:748)

00:30:04,510 WARN - 17/07/22 00:30:04 WARN TaskSetManager: Lost task 60.0 in stage 1518490.0 (TID 338070, 10.133.96.21, executor 0): java.io.FileNotFoundException: /mnt/mesos/work_dir/slaves/20160924-021501-274760970-5050-7646-S2/frameworks/40aeb8e5-e82a-4df9-b034-8815a7a7564b-2543/executors/0/runs/fd15c15d-2511-4f37-a106-27431f583153/blockmgr-a0e0e673-f88b-4d12-a802-c35643e6c6b2/33/shuffle_2090_60_0.index.b66235be-79be-4455-9759-1c7ba70f91f6 (No such file or directory) 00:30:04,510 WARN - at java.io.FileOutputStream.open0(Native Method) 00:30:04,510 WARN - at java.io.FileOutputStream.open(FileOutputStream.java:270) 00:30:04,510 WARN - at java.io.FileOutputStream.(FileOutputStream.java:213) 00:30:04,510 WARN - at java.io.FileOutputStream.(FileOutputStream.java:162) 00:30:04,510 WARN - at org.apache.spark.shuffle.IndexShuffleBlockResolver.writeIndexFileAndCommit(IndexShuffleBlockResolver.scala:144) 00:30:04,510 WARN - at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:128) 00:30:04,510 WARN - at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 00:30:04,510 WARN - at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 00:30:04,510 WARN - at org.apache.spark.scheduler.Task.run(Task.scala:99) 00:30:04,510 WARN - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) 00:30:04,510 WARN - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 00:30:04,510 WARN - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 00:30:04,510 WARN - at java.lang.Thread.run(Thread.java:748)

推荐答案

我终于找到了根本原因.Spark Jobs完全没有问题.我们有一个crontab,它错误地清理了/mnt处的临时存储,并错误地删除了火花缓存文件.

I found the root cause finally. There is no problem with Spark Jobs at all. We have a crontab which wrong clean up the temp storage at /mnt and wrongly delete the spark cache files.

这篇关于间歇性随机播放时,由于找不到文件而导致Spark Job崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆