Spark Streaming 中如何将作业分配给执行程序? [英] How jobs are assigned to executors in Spark Streaming?

查看:40
本文介绍了Spark Streaming 中如何将作业分配给执行程序?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我在 Spark Streaming 应用程序中有 2 个或更多执行程序.

Let's say I've got 2 or more executors in a Spark Streaming application.

我将批处理时间设置为 10 秒,因此每 10 秒启动一次作业,从我的 HDFS 读取输入.

I've set the batch time of 10 seconds, so a job is started every 10 seconds reading input from my HDFS.

如果每个作业持续时间超过 10 秒,那么新启动的作业会被分配给一个空闲的执行者,对吗?

If the every job lasts for more than 10 seconds, the new job that is started is assigned to a free executor right?

即使上一个没有完成?

我知道这似乎是一个显而易见的答案,但我没有在网站或与 Spark Streaming 相关的论文中找到任何关于作业调度的信息.

I know it seems like a obvious answer but I haven't found anything about job scheduling in the website or on the paper related to Spark Streaming.

如果您知道一些解释所有这些内容的链接,我将非常感激.

If you know some links where all of those things are explained, I would really appreciate to see them.

谢谢.

推荐答案

实际上,在 Spark Streaming 的当前实现和默认配置下,在任何时间点只有作业处于活动状态(即执行中).因此,如果一个批次的处理时间超过 10 秒,那么下一个批次的作业将保持排队.

Actually, in the current implementation of Spark Streaming and under default configuration, only job is active (i.e. under execution) at any point of time. So if one batch's processing takes longer than 10 seconds, then then next batch's jobs will stay queued.

这可以通过实验性 Spark 属性spark.streaming.concurrentJobs"进行更改,默认情况下设置为 1.目前没有记录(也许我应该添加它).

This can be changed with an experimental Spark property "spark.streaming.concurrentJobs" which is by default set to 1. Its not currently documented (maybe I should add it).

将其设置为 1 的原因是并发作业可能会导致奇怪的资源共享,这可能会导致难以调试系统中是否有足够的资源来足够快地处理摄取的数据.一次只运行 1 个作业,很容易看出如果批处理时间 <;批处理间隔,然后系统将稳定.诚然,这在某些条件下可能不是最有效的资源利用.我们绝对希望在未来改进这一点.

The reason it is set to 1 is that concurrent jobs can potentially lead to weird sharing of resources and which can make it hard to debug the whether there is sufficient resources in the system to process the ingested data fast enough. With only 1 job running at a time, it is easy to see that if batch processing time < batch interval, then the system will be stable. Granted that this may not be the most efficient use of resources under certain conditions. We definitely hope to improve this in the future.

在这个 meetup 幻灯片(对不起,关于无耻的自我广告:)).这可能对你有用.

There is a little bit of material regarding the internals of Spark Streaming in this meetup slides (sorry, about the shameless self advertising :) ). That may be useful to you.

这篇关于Spark Streaming 中如何将作业分配给执行程序?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆