Spark 作业在显示所有作业完成后重新启动,然后失败(TimeoutException:Futures 在 [300 秒] 后超时) [英] Spark job restarted after showing all jobs completed and then fails (TimeoutException: Futures timed out after [300 seconds])

查看:32
本文介绍了Spark 作业在显示所有作业完成后重新启动,然后失败(TimeoutException:Futures 在 [300 秒] 后超时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在执行一项 Spark 工作.它显示所有作业都已完成:

I'm running a spark job. It shows that all of the jobs were completed:

然而,几分钟后整个作业重新启动,这次它会显示所有作业和任务也已完成,但几分钟后它会失败.我在日志中发现了这个异常:

however after couple of minutes the entire job restarts, this time it will show all jobs and tasks were completed too, but after couple of minutes it will fail. I found this exception in the logs:

java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

所以当我尝试加入 2 个相当大的表时会发生这种情况:3B 行之一,第二个是 200M 行,当我在结果数据帧上运行 show(100) 时,一切得到评估,我遇到了这个问题.

So this happens when I'm trying to join 2 pretty big tables: one of 3B rows, and the second is 200M rows, when I run show(100) on the resulting dataframe, everything gets evaluated and I'm getting this issue.

我尝试增加/减少分区数,我将垃圾收集器更改为 G1,增加了线程数.我将 spark.sql.broadcastTimeout 更改为 600(这使得超时消息更改为 600 秒).

I tried playing around with increasing/decreasing the number of partitions, I changed the garbage collector to G1 with increased number of threads. I changed spark.sql.broadcastTimeout to 600 (which made the time out message to change to 600 seconds).

我还读到这可能是一个通信问题,但是在此代码段之前运行的其他 show() 子句没有问题,所以可能不是这样.

I also read that this might be a communication issue, however other show() clauses that run prior this code segment work without problems, so it's probably not it.

这是提交命令:

/opt/spark/spark-1.4.1-bin-hadoop2.3/bin/spark-submit  --master yarn-cluster --class className --executor-memory 12g --executor-cores 2 --driver-memory 32g --driver-cores 8 --num-executors 40 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:ConcGCThreads=20" /home/asdf/fileName-assembly-1.0.jar

您可以了解 Spark 版本以及从那里使用的资源.

you can get the idea about spark versions, and the resources used from there.

我该往哪里去?任何帮助将不胜感激,如果需要,将提供代码段/附加日志记录.

Where do I go from here? Any help will be appreciated, and code segments/additional logging will be provided if needed.

推荐答案

最终解决这个问题的是在加入之前持久化两个数据帧.

What solved this eventually was persisting both data frames before join.

我查看了数据帧持久化前后的执行计划,奇怪的是spark在持久化之前尝试了一个BroadcastHashJoin,显然是因为数据帧太大而失败了,并且在持久化执行计划后显示连接将是 ShuffleHashJoin,完成后没有任何问题.一个错误?也许,我会尝试使用更新的 Spark 版本.

I looked at the execution plan before and after persisting the data frames, and the strange thing was that before persisting spark tried to perform a BroadcastHashJoin, which clearly failed due to large size of the data frame, and after persisting the execution plan showed that the join will be ShuffleHashJoin, that completed without any issues whatsoever. A bug? Maybe, I'll try with a newer spark version when I'll get to it.

这篇关于Spark 作业在显示所有作业完成后重新启动,然后失败(TimeoutException:Futures 在 [300 秒] 后超时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆