Spark永远不会完成工作和阶段,JobProgressListener崩溃 [英] Spark never finishes jobs and stages, JobProgressListener crash

查看:225
本文介绍了Spark永远不会完成工作和阶段,JobProgressListener崩溃的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个Spark应用程序,可以连续处理大量传入作业.在多个线程上并行处理多个作业.

We have a Spark application that process continuously a lot of incoming jobs. Several jobs are processed in parallel, on multiple threads.

在繁重的工作量中,有时会出现这种警告:

During intensive workloads, at some point, we start to have this kind of warnings :

16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379
16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job 64610
16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage 147405
16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406
16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job 64622

从那开始,应用程序的性能直线下降,大多数Stages和Jobs都无法完成.在SparkUI上,我可以看到类似13000个待处理/活动作业的数字.

Starting from that, the performance of the app plummet, most of Stages and Jobs never finish. On SparkUI, I can see figures like 13000 pending/active jobs.

我无法清楚地看到更多信息之前发生的另一个异常.也许是这个,但它涉及到另一个听众:

I can't see clearly another exception happening before with more info. Maybe this one, but it concerns another listener :

16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since Thu Jan 01 01:00:00 CET 1970

这是一个非常烦人的问题,因为它不是崩溃或清除错误消息,我们无法捕捉到它们以重新启动该应用程序.

This is a very annoying problem, because it's not a clear crash, or clear ERROR message we could catch to relaunch the app.

更新:

最让我烦恼的是,我希望这会在大型配置上发生(大型集群会使DDOS驱动程序更容易获得任务结果),但事实并非如此.我们的集群很小,唯一的特点是我们倾向于混合使用大小文件,小型文件会生成许多任务,这些任务很快就可以完成.

What bugs me most is that I would expect this to happen on large configurations (a large cluster would DDOS the driver with task results more easily), but it's not the case. Our cluster is kind of small, the only particularity is that we tend to have a mix of small and large files to process, and small files generate many tasks that finish quickly.

推荐答案

我可能已经找到解决方法:

I may have found a workaround :

更改spark.scheduler.listenerbus.eventqueue.size的值(从100000代替默认的10000)似乎有帮助,但这可能只会推迟问题.

Changing value of spark.scheduler.listenerbus.eventqueue.size (100000 instead of default 10000) seems to help, but it may only postpone the problem.

这篇关于Spark永远不会完成工作和阶段,JobProgressListener崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆