Spark 集群充满心跳超时,执行器自行退出 [英] Spark cluster full of heartbeat timeouts, executors exiting on their own

查看:41
本文介绍了Spark 集群充满心跳超时,执行器自行退出的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 Apache Spark 集群正在运行一个应用程序,它给了我很多执行程序超时:

My Apache Spark cluster is running an application that is giving me lots of executor timeouts:

10:23:30,761 ERROR ~ Lost executor 5 on slave2.cluster: Executor heartbeat timed out after 177005 ms
10:23:30,806 ERROR ~ Lost executor 1 on slave4.cluster: Executor heartbeat timed out after 176991 ms
10:23:30,812 ERROR ~ Lost executor 4 on slave6.cluster: Executor heartbeat timed out after 176981 ms
10:23:30,816 ERROR ~ Lost executor 6 on slave3.cluster: Executor heartbeat timed out after 176984 ms
10:23:30,820 ERROR ~ Lost executor 0 on slave5.cluster: Executor heartbeat timed out after 177004 ms
10:23:30,835 ERROR ~ Lost executor 3 on slave7.cluster: Executor heartbeat timed out after 176982 ms

但是,在我的配置中,我可以确认我成功地增加了执行程序心跳间隔:

However, in my configuration I can confirm I successfully increased the executor heartbeat interval:

当我访问标记为 EXITED 的 executor 的日志时(即:驱动程序在无法获得心跳时将它们删除),似乎 executors 自杀了,因为他们没有收到任何来自司机的任务:

When I visit the logs of executors marked as EXITED (i.e.: the driver removed them when it couldn't get a heartbeat), it appears that executors killed themselves because they didn't receive any tasks from the driver:

16/05/16 10:11:26 ERROR TransportChannelHandler: Connection to /10.0.0.4:35328 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
16/05/16 10:11:26 ERROR CoarseGrainedExecutorBackend: Cannot register with driver: spark://CoarseGrainedScheduler@10.0.0.4:35328

如何关闭心跳和/或防止执行程序超时?

How can I turn off heartbeats and/or prevent the executors from timing out?

推荐答案

答案很简单.在我的 spark-defaults.conf 中,我将 spark.network.timeout 设置为更高的值.心跳间隔与问题无关(尽管调整很方便).

The answer was rather simple. In my spark-defaults.conf I set the spark.network.timeout to a higher value. Heartbeat interval was somewhat irrelevant to the problem (though tuning is handy).

当使用 spark-submit 时,我还可以按如下方式设置超时:

When using spark-submit I was also able to set the timeout as follows:

$SPARK_HOME/bin/spark-submit --conf spark.network.timeout 10000000 --class myclass.neuralnet.TrainNetSpark --master spark://master.cluster:7077 --driver-memory 30G --executor-memory 14G --num-executors 7 --executor-cores 8 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000 path/to/my.jar

这篇关于Spark 集群充满心跳超时,执行器自行退出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆