显示所有作业完成后,Spark作业重新启动,然后失败(TimeoutException:[300秒]之后,期货超时) [英] Spark job restarted after showing all jobs completed and then fails (TimeoutException: Futures timed out after [300 seconds])

查看:161
本文介绍了显示所有作业完成后,Spark作业重新启动,然后失败(TimeoutException:[300秒]之后,期货超时)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在从事火花工作.它显示所有工作均已完成:

I'm running a spark job. It shows that all of the jobs were completed:

不过,几分钟后,整个作业将重新启动,这一次它将显示所有作业和任务也已完成,但是几分钟后,它将失败. 我在日志中发现了此异常:

however after couple of minutes the entire job restarts, this time it will show all jobs and tasks were completed too, but after couple of minutes it will fail. I found this exception in the logs:

java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]

因此,当我尝试连接2个非常大的表时就会发生这种情况:3B行之一,第二行为200M行,当我在结果数据帧上运行show(100)时,所有内容都经过评估,而我得到了这个问题.

So this happens when I'm trying to join 2 pretty big tables: one of 3B rows, and the second is 200M rows, when I run show(100) on the resulting dataframe, everything gets evaluated and I'm getting this issue.

我尝试增加/减少分区数,然后通过增加线程数将垃圾回收器更改为G1.我将spark.sql.broadcastTimeout更改为600(这使超时消息更改为600秒).

I tried playing around with increasing/decreasing the number of partitions, I changed the garbage collector to G1 with increased number of threads. I changed spark.sql.broadcastTimeout to 600 (which made the time out message to change to 600 seconds).

我还读到这可能是一个通信问题,但是在此代码段之前运行的其他show()子句可以正常工作,所以可能不是.

I also read that this might be a communication issue, however other show() clauses that run prior this code segment work without problems, so it's probably not it.

这是Submit命令:

This is the submit command:

/opt/spark/spark-1.4.1-bin-hadoop2.3/bin/spark-submit  --master yarn-cluster --class className --executor-memory 12g --executor-cores 2 --driver-memory 32g --driver-cores 8 --num-executors 40 --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:ConcGCThreads=20" /home/asdf/fileName-assembly-1.0.jar

您可以了解有关Spark版本以及从那里使用的资源的想法.

you can get the idea about spark versions, and the resources used from there.

我从这里去哪里?我们将不胜感激,如有需要,还将提供代码段/其他日志记录.

Where do I go from here? Any help will be appreciated, and code segments/additional logging will be provided if needed.

推荐答案

最终解决此问题的方法是在连接之前持久保留两个数据帧.

What solved this eventually was persisting both data frames before join.

我查看了持久保存数据帧之前和之后的执行计划,奇怪的是,在持久保存spark之前尝试执行BroadcastHashJoin,显然由于数据帧的大小而失败,并且持久保存了.执行计划显示联接将为ShuffleHashJoin,该联接已完成,没有任何问题.有毛病吗也许,我会尝试使用更新的Spark版本.

I looked at the execution plan before and after persisting the data frames, and the strange thing was that before persisting spark tried to perform a BroadcastHashJoin, which clearly failed due to large size of the data frame, and after persisting the execution plan showed that the join will be ShuffleHashJoin, that completed without any issues whatsoever. A bug? Maybe, I'll try with a newer spark version when I'll get to it.

这篇关于显示所有作业完成后,Spark作业重新启动,然后失败(TimeoutException:[300秒]之后,期货超时)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆