谷歌Dataproc超时并杀死excutors [英] Google Dataproc timing out and killing excutors

查看:170
本文介绍了谷歌Dataproc超时并杀死excutors的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由一个主节点和16个工作节点组成的Google dataproc spark集群。大师有2个CPU和13g内存,每个工作者有2个CPU和3.5g内存。我正在运行一个网络密集型工作,我有一个16个对象的数组,我将这个数组分成16个分区,因此每个工人都得到一个对象。这些对象提供了大约250万个网络请求,并将它们聚合起来发送回主设备。每个请求都是Solr响应,小于50k。从响应中提取一个字段(一个ID,作为字符串)并添加到列表中以发送回主站。这个过程将在大约1-2个小时内完成。

然而,在执行过程中的某个时刻,我会在主控失去执行者心跳并杀死它的时候不断收到错误。没有更多的细节发送到主人比超时,并且工作人员的日志显示它正常运行。我试图安装stackdriver monitoring来查看这是否是RAM问题,但代理延迟时间超过一小时,因此最多2分钟,因此我没有任何最新的内存信息。



有人对此有何想法?我的想法可能是网络端口被从作业中淹没,因此实例无法发送心跳或实例指标,可能是内存问题(我尝试使用相同的错误获取大多数内存值),或者存在一些问题问题在谷歌方面。

解决方案

感谢@Dennis的评论,我设法发现一个OOM异常被抛出执行者被杀害。我从来没有见过它,因为这个错误只是标准输出,而不是像预期的那样出现任何错误日志。


I have a google dataproc spark cluster set up with one master node, and 16 worker nodes. The master has 2 cpus and 13g of memory and each worker has 2 cpus and 3.5g of memory. I am running a rather network-intensive job where I have an array of 16 objects and I partition this array into 16 partitions so each worker gets one object. The objects make about 2.5 million web requests and aggregates them to send back to the master. Each request is a Solr response and is less than 50k. One field (an ID, as a string) from the response is extracted and added to the list to send back to the master. This process will finish in about 1-2 hours.

However, at some point in the execution, I keep getting an error where the master loses an executor's heartbeat and kills it. No more details are sent to the master than the time out and the worker's log shows that it was just running as normal. I tried to install stackdriver monitoring to see if this is a RAM problem, but the agent latency is over an hour when it should be max 2 minutes so I do not have any up-to-date memory information.

Does anyone have an idea as to why this is happening? My ideas are maybe the network ports are being flooded from the job so the instance can't send out the heartbeat or instance metrics, possibly a RAM issue (I get the same error for pretty most RAM values I try), or there is some issue on Google's side.

解决方案

Thanks to the comment by @Dennis, I managed to find that an OOM exception was being thrown by the executor being killed. I never saw it before because this error was only ouputted in standard out, instead of any of the error logs as one would expect.

这篇关于谷歌Dataproc超时并杀死excutors的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆