火花流中批处理间隔,滑动间隔和窗口大小之间的差异 [英] Difference between batch interval, sliding interval and window size in spark streaming

查看:313
本文介绍了火花流中批处理间隔,滑动间隔和窗口大小之间的差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新的火花流.我知道窗口大小必须是批处理间隔的倍数.但是滑动间隔如何工作?如果我有3个窗口大小和2个滑动间隔,当我计算说出的字数时,会不会有重叠?还是滑动间隔和批处理间隔应该相同?

I am new spark streaming. I understood window size needs to be a multiple of the batch interval. But how does the sliding interval work? If i have 3 as window size and 2 as sliding interval, wouldn't there be a overlap when i calculate say word counts? Or should the sliding interval and batch interval should be the same?

推荐答案

此处是文档的链接.

让我们看一下这些概念:

Let's walk through these concepts:

  1. 批处理间隔-以秒为单位的时间,数据将在该处理上分派之前要收集多长时间.例如,如果您将批处理间隔设置为5秒-Spark Streaming将收集5秒的数据,然后使用该数据对RDD进行计算.
  2. 窗口大小-处理之前,RDD中应包含多少历史数据的时间间隔(以秒为单位).例如,您有1秒的批处理间隔,窗口大小为2-在这种情况下,您将有2个先前批处理的计算每秒被踢出.例如,在时间= 3,您将在时间= 2和时间= 3获得批次数据.
  3. 滑动间隔-是时间(以秒为单位),表示窗口将移动多少.例如,在前面的示例中,滑动间隔为1(因为计算每秒被踢出),例如在时间= 1,时间= 2,时间= 3 ...如果设置滑动间隔= 2,则将在时间= 1,时间= 3,时间= 5 ...
  4. 进行计算
  1. batch interval - it is time in seconds how long data will be collected before dispatching processing on it. For example if you set batch interval 5 seconds - Spark Streaming will collect data for 5 seconds and then kick out calculation on RDD with that data.
  2. window size - it is interval of time in seconds for how much historical data shall be contained in RDD before processing. For example you have 1 second batch interval and window size of 2 - in that case you will have calculation kicked out each second for 2 previous batches. E.g at time=3 you will have data from batch at time=2 and time=3.
  3. sliding interval - is amount of time in seconds for how much the window will shift. For example in previous example sliding interval is 1 (since calculation is kicked out each second) e.g. at time=1, time=2, time=3... if you set sliding interval=2, you will get calculation at time=1, time=3, time=5...

您可以参考上面的图像,其中窗口大小是批处理间隔的3倍,而滑动窗口是批处理间隔的2倍.

You can refer to image above where window size is 3 times of batch interval and sliding window is 2 times of batch interval.

回答一个问题,为什么窗口和滑动间隔应为批处理间隔的倍数-这是因为否则,您的窗口将在批处理之间结束.

To answer a question why window and sliding intervals shall be multiple of batch interval - it is because otherwise your window will end inbetween batch.

如果窗口大小为3,滑动间隔为2(请参见图片)-是的,您的字数将重叠.基本上,当您需要在有限的时间内计算 时(例如实际新闻或推文之类的东西),而又不需要所有历史数据进行分析时,则使用window.

If you have 3 as window size and 2 as sliding interval (see image) - yes, your word count will overlap. Basically you use window when you want to calculate something for some limited time - like actual news or tweets or whatever, when you don't need all historical data for the analysis.

这篇关于火花流中批处理间隔,滑动间隔和窗口大小之间的差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆