火花:工作之间的长时间延迟 [英] Spark: long delay between jobs

查看：260 发布时间：2020/9/4 0:13:02 scala hadoop apache-spark

本文介绍了火花:工作之间的长时间延迟的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

因此，我们正在运行spark作业，该作业提取数据并进行一些扩展的数据转换并写入几个不同的文件.一切运行良好，但是在资源密集型作业完成与下一个作业开始之间出现了随机的扩展延迟.

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.

在下图中，我们可以看到原定于17:22:02进行的作业花了15分钟才能完成，这意味着我希望下一份作业定于17:37:02左右.但是，下一份工作安排在22:05:59，这是工作成功后+4个小时.

In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.

当我深入研究下一个作业的Spark UI时，它会显示< 1秒的调度程序延迟.因此，我对这4个小时的延迟来自何处感到困惑.

When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.

(带有Hadoop 2的Spark 1.6.1)

(Spark 1.6.1 with Hadoop 2)

已更新:

我可以肯定的是，David的以下回答是关于Spark中如何处理IO ops的，这是出乎意料的. (有意义的是，该文件写入在考虑顺序和/或其他操作之前实际上在幕后收集"了.)但是我对I/O时间不包括在作业执行时间这一事实感到不满意.我猜您可以在Spark UI的"SQL"选项卡中看到它，因为即使所有作业都成功，查询仍在运行，但是您根本无法深入研究它.

I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.

我相信还有更多的改进方法，但是下面两种方法对我来说足够了:

I'm sure there are more ways to improve but below two methods were sufficient for me:

减少文件数量
将parquet.enable.summary-metadata设置为false

reduce file count
set parquet.enable.summary-metadata to false

火花:工作之间的长时间延迟 [英] Spark: long delay between jobs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

火花:工作之间的长时间延迟 [英] Spark: long delay between jobs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭