火花驱动程序被主人分离并移除 [英] Spark driver disassociated and removed by the master

查看:105
本文介绍了火花驱动程序被主人分离并移除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由两个奴隶和一个主人组成的集群,并且我将一个jar(scala)提交给spark master(192.168.1.64):

I have a cluster made by two slaves and one master and set up and I submit a jar (scala) to the spark master (192.168.1.64):

spark-submit --master spark://spark-master:7077 --class tests.elements target/scala-2.10/zzz-project_2.10-1.0.jar

经过相当长时间的运行后,它终于突然停下来,终端上的最后一行是

After quite sometime running just fine it stops abruptly with the last lines on the terminal being

...
15/08/19 17:45:24 INFO scheduler.TaskSchedulerImpl: Adding task set 411292.0 with 6 tasks
15/08/19 17:45:24 WARN scheduler.TaskSetManager: Stage 411292 contains a task of very large size (2762 KB). The maximum recommended task size is 100 KB.
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 411292.0 (TID 1832, 192.168.1.64, PROCESS_LOCAL, 2828792 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 411292.0 (TID 1833, 192.168.1.62, PROCESS_LOCAL, 2310009 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 411292.0 (TID 1834, 192.168.1.64, PROCESS_LOCAL, 2669188 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 411292.0 (TID 1835, 192.168.1.62, PROCESS_LOCAL, 2295676 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 411292.0 (TID 1836, 192.168.1.64, PROCESS_LOCAL, 2847786 bytes)
15/08/19 17:45:24 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 411292.0 (TID 1837, 192.168.1.64, PROCESS_LOCAL, 2913528 bytes)
Killed

,并且在主日志中发生的错误如下:

and the error occurring at the master log is the following:

...
15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/0 on worker worker-20150819160925-192.168.1.64-51640
15/08/19 16:09:49 INFO master.Master: Launching executor app-20150819160949-0001/1 on worker worker-20150819160938-192.168.1.62-38007
15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver@192.168.1.64:46823 got disassociated, removing it.
15/08/19 16:15:44 INFO master.Master: Removing app app-20150819160949-0001
15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/19 16:15:44 WARN master.Master: Application testPageRank is still in progress, it may be terminated abnormally.
...

这两名工作人员在他们的日志中都有类似这样的内容

Both workers have in their logs something like this

...
15/08/19 16:15:49 INFO worker.Worker: Executor app-20150819160949-0001/0 finished with state EXITED message Command exited with code 1 exitStatus 1
15/08/19 16:15:50 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@192.168.1.64:54799] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

...
15/08/19 16:15:43 INFO worker.Worker: Executor app-20150819160949-0001/1 finished with state EXITED message Command exited with code 1 exitStatus 1
15/08/19 16:15:43 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@192.168.1.62:53325] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].

。工作/应用程序文件包含这样的内容

respectively. The work/app files contain something like this

...
15/08/19 16:15:41 INFO executor.Executor: Finished task 1.0 in stage 387758.0 (TID 1803). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO executor.Executor: Finished task 4.0 in stage 387758.0 (TID 1806). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_5 locally
15/08/19 16:15:41 INFO executor.Executor: Finished task 5.0 in stage 387758.0 (TID 1807). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_3 locally
15/08/19 16:15:41 INFO executor.Executor: Finished task 3.0 in stage 387758.0 (TID 1805). 1911 bytes result sent to driver
15/08/19 16:15:44 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down.
15/08/19 16:15:44 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/19 16:15:45 INFO storage.DiskBlockManager: Shutdown hook called
15/08/19 16:15:46 INFO util.Utils: Shutdown hook called

...
15/08/19 16:15:41 INFO storage.BlockManager: Found block rdd_1206_0 locally
15/08/19 16:15:41 INFO executor.Executor: Finished task 2.0 in stage 387758.0 (TID 1804). 1911 bytes result sent to driver
15/08/19 16:15:41 INFO executor.Executor: Finished task 0.0 in stage 387758.0 (TID 1802). 1911 bytes result sent to driver
15/08/19 16:15:42 ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.1.64:46823 disassociated! Shutting down.
15/08/19 16:15:42 INFO storage.DiskBlockManager: Shutdown hook called
15/08/19 16:15:42 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver@192.168.1.64:46823] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/08/19 16:15:42 INFO util.Utils: Shutdown hook called

分别。在hdfs或spark中似乎没有其他错误。

respectively. There seem to be no other error in hdfs or spark.

我怀疑错误在于主日志,第三行( 15 / 08/19 16:15:44 INFO master.Master:akka.tcp://sparkDriver@192.168.1.64:46823解除关联,删除它。)但我找不到原因。我尝试将 spark.akka.heartbeat.interval 更改为100,如某些帖子中的建议,但没有运气。任何人都会知道为什么会发生以及如何解决这个问题?非常感谢。

I am suspecting that the error lies in the master log, the third line (15/08/19 16:15:44 INFO master.Master: akka.tcp://sparkDriver@192.168.1.64:46823 got disassociated, removing it.) but I can't figure out why. I tried changing the spark.akka.heartbeat.interval to 100 as suggested in some posts but no luck. Anyone would know why it happens and how to solve this? Thanks so much.

推荐答案

正如这里提到的一个非常类似的问题警告ReliableDeliverySupervisor:与远程系统的关联失败,地址现在门控为[5000] ms。原因:[解除关联]

As mentioned in a very similar question here WARN ReliableDeliverySupervisor: Association with remote system has failed, address is now gated for [5000] ms. Reason: [Disassociated]

问题可能是内存不足。增加更多内存(或者在这种情况下更多节点)应该可以解决问题。

The problem is likely to be the lack of memory. Adding more memory (or in that case more nodes) should solve the problem.

(或者,需要更少的内存当然也应该工作)。

(Alternately, needing less memory should work too of course).

这篇关于火花驱动程序被主人分离并移除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆