如何在Spark Streaming中将作业分配给执行者? [英] How jobs are assigned to executors in Spark Streaming?

查看:93
本文介绍了如何在Spark Streaming中将作业分配给执行者?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我在Spark Streaming应用程序中有2个或更多执行程序.

Let's say I've got 2 or more executors in a Spark Streaming application.

我将批处理时间设置为10秒,因此每隔10秒就会启动一次作业,以读取我的HDFS中的输入.

I've set the batch time of 10 seconds, so a job is started every 10 seconds reading input from my HDFS.

如果每个作业持续10秒钟以上,新启动的作业会分配给免费执行者吗?

If the every job lasts for more than 10 seconds, the new job that is started is assigned to a free executor right?

即使上一个没有完成?

我知道这似乎是一个显而易见的答案,但是我没有在网站上或与Spark Streaming相关的论文上找到任何有关工作安排的信息.

I know it seems like a obvious answer but I haven't found anything about job scheduling in the website or on the paper related to Spark Streaming.

如果您知道一些解释了所有这些内容的链接,我将不胜感激.

If you know some links where all of those things are explained, I would really appreciate to see them.

谢谢.

推荐答案

实际上,在Spark Streaming的当前实现中以及默认配置下,在任何时间点只有作业处于活动状态(即正在执行).因此,如果一个批次的处理时间超过10秒,则下一个批次的作业将保持排队状态.

Actually, in the current implementation of Spark Streaming and under default configuration, only job is active (i.e. under execution) at any point of time. So if one batch's processing takes longer than 10 seconds, then then next batch's jobs will stay queued.

这可以通过实验性的Spark属性"spark.streaming.concurrentJobs"进行更改,该属性默认设置为1.目前尚无记录(也许我应该添加它).

This can be changed with an experimental Spark property "spark.streaming.concurrentJobs" which is by default set to 1. Its not currently documented (maybe I should add it).

将其设置为1的原因是,并发作业可能会导致资源的奇怪共享,这可能使调试系统中是否有足够的资源来足够快地处理提取的数据变得困难.一次仅运行一个作业,就很容易看出批处理时间是否小于"批间隔,那么系统将是稳定的.可以肯定的是,在某些情况下,这可能不是最有效的资源利用方式.我们绝对希望将来能改善这一点.

The reason it is set to 1 is that concurrent jobs can potentially lead to weird sharing of resources and which can make it hard to debug the whether there is sufficient resources in the system to process the ingested data fast enough. With only 1 job running at a time, it is easy to see that if batch processing time < batch interval, then the system will be stable. Granted that this may not be the most efficient use of resources under certain conditions. We definitely hope to improve this in the future.

在此会议幻灯片中,有一些有关Spark Streaming内部的材料(抱歉,关于无耻的自我广告:)).这可能对您有用.

There is a little bit of material regarding the internals of Spark Streaming in this meetup slides (sorry, about the shameless self advertising :) ). That may be useful to you.

这篇关于如何在Spark Streaming中将作业分配给执行者?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆