Spark 集群充满心跳超时,执行器自行退出 [英] Spark cluster full of heartbeat timeouts, executors exiting on their own
问题描述
我的 Apache Spark 集群正在运行一个应用程序,它给了我很多执行程序超时:
My Apache Spark cluster is running an application that is giving me lots of executor timeouts:
10:23:30,761 ERROR ~ Lost executor 5 on slave2.cluster: Executor heartbeat timed out after 177005 ms
10:23:30,806 ERROR ~ Lost executor 1 on slave4.cluster: Executor heartbeat timed out after 176991 ms
10:23:30,812 ERROR ~ Lost executor 4 on slave6.cluster: Executor heartbeat timed out after 176981 ms
10:23:30,816 ERROR ~ Lost executor 6 on slave3.cluster: Executor heartbeat timed out after 176984 ms
10:23:30,820 ERROR ~ Lost executor 0 on slave5.cluster: Executor heartbeat timed out after 177004 ms
10:23:30,835 ERROR ~ Lost executor 3 on slave7.cluster: Executor heartbeat timed out after 176982 ms
但是,在我的配置中,我可以确认我成功地增加了执行程序心跳间隔:
However, in my configuration I can confirm I successfully increased the executor heartbeat interval:
当我访问标记为 EXITED
的 executor 的日志时(即:驱动程序在无法获得心跳时将它们删除),似乎 executors 自杀了,因为他们没有收到任何来自司机的任务:
When I visit the logs of executors marked as EXITED
(i.e.: the driver removed them when it couldn't get a heartbeat), it appears that executors killed themselves because they didn't receive any tasks from the driver:
16/05/16 10:11:26 ERROR TransportChannelHandler: Connection to /10.0.0.4:35328 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
16/05/16 10:11:26 ERROR CoarseGrainedExecutorBackend: Cannot register with driver: spark://CoarseGrainedScheduler@10.0.0.4:35328
如何关闭心跳和/或防止执行程序超时?
How can I turn off heartbeats and/or prevent the executors from timing out?
推荐答案
答案很简单.在我的 spark-defaults.conf
中,我将 spark.network.timeout
设置为更高的值.心跳间隔与问题无关(尽管调整很方便).
The answer was rather simple. In my spark-defaults.conf
I set the spark.network.timeout
to a higher value. Heartbeat interval was somewhat irrelevant to the problem (though tuning is handy).
当使用 spark-submit
时,我还可以按如下方式设置超时:
When using spark-submit
I was also able to set the timeout as follows:
$SPARK_HOME/bin/spark-submit --conf spark.network.timeout 10000000 --class myclass.neuralnet.TrainNetSpark --master spark://master.cluster:7077 --driver-memory 30G --executor-memory 14G --num-executors 7 --executor-cores 8 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000 path/to/my.jar
这篇关于Spark 集群充满心跳超时,执行器自行退出的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!