谷歌Dataproc - 断开与执行人往往 [英] Google Dataproc - disconnect with executors often

查看:242
本文介绍了谷歌Dataproc - 断开与执行人往往的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Dataproc跑过来用火花壳簇星火命令。我经常收到错误/警告,表明我失去了我的执行人连接消息。这些消息是这样的:

I am using Dataproc to run Spark commands over a cluster using spark-shell. I frequently get error/warning messages indicating that I lose connection with my executors. The messages look like this:

[Stage 6:>                                                          (0 + 2) / 2]16/01/20 10:10:24 ERROR     org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 5 on spark-cluster-femibyte-w-0.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:10:24 WARN akka.remote.ReliableDeliverySupervisor:  Association with remote system [akka.tcp://sparkExecutor@spark-cluster-  femibyte-w-0.c.gcebook-1039.internal:60599] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.2 in stage 6.0 (TID 17, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)
16/01/20 10:10:24 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.2 in stage 6.0 (TID 16, spark-cluster-femibyte-w-0.c.gcebook-1039.internal): ExecutorLostFailure (executor 5 lost)

...

下面是另一个例子:

20 10:51:43 ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 2 on spark-cluster-femibyte-w-1.c.gcebook-1039.internal: remote Rpc client disassociated
16/01/20 10:51:43 WARN akka.remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@spark-cluster-femibyte-w-1.c.gcebook-1039.internal:58745] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 5, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 (TID 4, spark-cluster-femibyte-w-1.c.gcebook-1039.internal): ExecutorLostFailure (executor 2 lost)
16/01/20 10:51:43 WARN org.apache.spark.ExecutorAllocationManager:  Attempted to mark unknown executor 2 idle

这是正常的吗?有什么我可以做,以prevent呢?

Is this normal ? Is there anything I can do to prevent this ?

推荐答案

如果工作本身没有失败,与事实,你没有看到实际任务失败(至少据我相关的其他传播错误可以告诉什么是张贴在这个问题)最有可能你只是看到无害的,但知在核心星火垃圾问题;这里的关键是一个工作期间星火动态分配未充分利用的放弃遗嘱执行人,并根据需要重新分配他们。他们最初未能燮preSS它的执行者,失去一部分,但我们已经测试过tomake确保它具有实际工作没有不良影响。

If the job itself isn't failing, with the fact that you're not seeing other propagated errors associated with actual task failures (at least as far as I can tell from what's posted in the question) most likely you're just seeing the harmless but known to be spammy issue in core Spark; the key here is that Spark dynamic allocation relinquishes underused executors during a job, and re-allocates them as needed. They originally failed to suppress the executor-lost part of it, but we've tested tomake sure it has no ill effects on the actual job.

下面的一个Google网上论坛线程突出一些纱线星火的行为细节。

Here's a googlegroups thread highlighting some of the behavioral details of Spark on YARN.

要检查它是否是真的动态分配引起的消息,尝试运行:

To check whether it's indeed dynamic allocation causing the messages, try running:

spark-shell --conf spark.dynamicAllocation.enabled=false \
    --conf spark.executor.instances=99999

或者,如果你通过提交作业 gcloud测试Dataproc工作,则:

gcloud beta dataproc jobs submit spark \
    --properties spark.dynamicAllocation.enabled=false,spark.executor.instances=99999

如果你真的看到网络打嗝或其他Dataproc错误解离主/工人时,它不是一个应用程序端OOM什么的,您可以直接dataproc-feedback@google.com电子邮件的Dataproc团队;测试将是潜在的破碎的行为没有任何借口(当然我们希望淘汰,我们可能还没有在测试期间发现的棘手边缘的情况下的错误)。

If you're really seeing network hiccups or other Dataproc errors disassociating the master/worker when it's not an application-side OOM or something, you can email the Dataproc team directly at dataproc-feedback@google.com; beta would be no excuse for latent broken behavior (though of course we hope to weed out tricky edge-case bugs that we may not have discovered yet during the beta period).

这篇关于谷歌Dataproc - 断开与执行人往往的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆