如何限制Spark作业失败的重试次数? [英] How to limit the number of retries on Spark job failure?
问题描述
我们正在通过spark-submit
运行Spark作业,我可以看到在失败的情况下将重新提交该作业.
在纱线容器出现故障或发生任何异常情况时,如何阻止它进行第二次尝试?
这是由于内存不足和超出GC开销限制"问题引起的.
有两个设置可控制重试次数(即,使用YARN进行的ApplicationMaster
注册尝试的最大次数被认为是失败的,因此整个Spark应用程序都将被视为失败). :
-
spark.yarn.maxAppAttempts
-Spark自己的设置.参见 YarnRMClient.getMaxRegAttempts ),实际数字是YARN和Spark的配置设置中的最小值,而YARN是最后的选择.>We are running a Spark job via
spark-submit
, and I can see that the job will be re-submitted in the case of failure.How can I stop it from having attempt #2 in case of yarn container failure or whatever the exception be?
This happened due to lack of memory and "GC overhead limit exceeded" issue.
解决方案There are two settings that control the number of retries (i.e. the maximum number of
ApplicationMaster
registration attempts with YARN is considered failed and hence the entire Spark application):spark.yarn.maxAppAttempts
- Spark's own setting. See MAX_APP_ATTEMPTS:private[spark] val MAX_APP_ATTEMPTS = ConfigBuilder("spark.yarn.maxAppAttempts") .doc("Maximum number of AM attempts before failing the app.") .intConf .createOptional
yarn.resourcemanager.am.max-attempts
- YARN's own setting with default being 2.
(As you can see in YarnRMClient.getMaxRegAttempts) the actual number is the minimum of the configuration settings of YARN and Spark with YARN's being the last resort.
这篇关于如何限制Spark作业失败的重试次数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!