火花:工作之间的长时间延迟 [英] Spark: long delay between jobs

查看:260
本文介绍了火花:工作之间的长时间延迟的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我们正在运行spark作业,该作业提取数据并进行一些扩展的数据转换并写入几个不同的文件.一切运行良好,但是在资源密集型作业完成与下一个作业开始之间出现了随机的扩展延迟.

So we are running spark job that extract data and do some expansive data conversion and writes to several different files. Everything is running fine but I'm getting random expansive delays between resource intensive job finish and next job start.

在下图中,我们可以看到原定于17:22:02进行的作业花了15分钟才能完成,这意味着我希望下一份作业定于17:37:02左右.但是,下一份工作安排在22:05:59,这是工作成功后+4个小时.

In below picture, we can see that job that was scheduled at 17:22:02 took 15 min to finish, which means I'm expecting next job to be scheduled around 17:37:02. However, next job was scheduled at 22:05:59, which is +4 hours after job success.

当我深入研究下一个作业的Spark UI时,它会显示< 1秒的调度程序延迟.因此,我对这4个小时的延迟来自何处感到困惑.

When I dig into next job's spark UI it show <1 sec scheduler delay. So I'm confused to where does this 4 hours long delay is coming from.

(带有Hadoop 2的Spark 1.6.1)

(Spark 1.6.1 with Hadoop 2)

已更新:

我可以肯定的是,David的以下回答是关于Spark中如何处理IO ops的,这是出乎意料的. (有意义的是,该文件写入在考虑顺序和/或其他操作之前实际上在幕后收集"了.)但是我对I/O时间不包括在作业执行时间这一事实感到不满意.我猜您可以在Spark UI的"SQL"选项卡中看到它,因为即使所有作业都成功,​​查询仍在运行,但是您根本无法深入研究它.

I can confirm that David's answer below is spot on about how IO ops are handled in Spark is bit unexpected. (It makes sense to that file write essentially does "collect" behind the curtain before it writes considering ordering and/or other operations.) But I'm bit discomforted by the fact that I/O time is not included in job execution time. I guess you can see it in "SQL" tab of spark UI as queries are still running even with all jobs being successful but you cannot dive into it at all.

我相信还有更多的改进方法,但是下面两种方法对我来说足够了:

I'm sure there are more ways to improve but below two methods were sufficient for me:

  1. 减少文件数量
  2. parquet.enable.summary-metadata设置为false
  1. reduce file count
  2. set parquet.enable.summary-metadata to false

推荐答案

I/O操作通常带有大量开销,这些开销将在主节点上发生.由于这项工作没有并行化,因此可能会花费很多时间.并且由于它不是工作,因此它不会显示在资源管理器UI中.主节点完成的一些I/O任务示例

I/O operations often come with significant overhead that will occur on the master node. Since this work isn't parallelized, it can take quite a bit of time. And since it is not a job, it does not show up in the resource manager UI. Some examples of I/O tasks that are done by the master node

  • Spark将写入临时s3目录,然后使用主节点移动文件
  • 读取文本文件通常发生在主节点上
  • 写入实木复合地板文件时,主节点将在写入后扫描所有文件以检查架构

可以通过调整纱线设置或重新设计代码来解决这些问题.如果您提供一些源代码,我也许可以查明您的问题.

These issues can be solved by tweaking yarn settings or redesigning your code. If you provide some source code, I might be able to pinpoint your issue.

使用Parquet和s3编写I/O开销的讨论

讨论读取I/O开销为"s3不是文件系统"

这篇关于火花:工作之间的长时间延迟的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆