在 Flink 流中使用状态和窗口(时间)的区别 [英] Differences between working with states and windows(time) in Flink streaming

查看:38
本文介绍了在 Flink 流中使用状态和窗口(时间)的区别的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们要计算项目的总和和平均值,并且可以使用 stateswindows(时间).

使用 windows 的示例 -https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#example-program

使用 states 的示例 -https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/java/com/dataartisans/flinktraining/exercises/datastream_java/ride_speed/RideSpeed.java

我可以问一下做出决定的原因是什么?我是否可以推断,如果数据非常不规则地到达(50% 进入定义的窗口长度,另外 50% 没有),窗口方法的结果更偏向(因为 50% 的事件被丢弃)?

另一方面,在处理状态时,我们是否花更多时间检查和更新状态?

解决方案

首先,这取决于您的语义... 这两个示例使用不同的语义,因此不能直接比较.此外,windows 也在内部处理状态.很难说一般方法是更好的方法.

由于 Flink 的窗口语义非常丰富,我建议使用 windows.如果你不能用 windows 表达你的语义,使用 state 可能是一个很好的选择.使用 Windows,还有一个额外的好处,那就是状态处理——很难正确完成——会自动为你完成.

该决定绝对与您的数据到达率无关.Flink 不会丢弃任何数据.如果您使用事件时间(而不是处理时间),您的结果将与数据到达率无关.

Let's say we want to compute the sum and average of the items, and can either working with states or windows(time).

Example working with windows - https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/streaming_guide.html#example-program

Example working with states - https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/java/com/dataartisans/flinktraining/exercises/datastream_java/ride_speed/RideSpeed.java

Can I ask what would be the reasons to make decision? Can I infer that if the data arrives very irregularly (50% comes in the defined window length and the other 50% don't), the result of the window approach is more biased (because the 50% events are dropped)?

On the other hand, do we spend more time checking and updating the states when working with states?

解决方案

First, it depends on your semantics... The two examples use different semantics and are thus not comparable directly. Furthermore, windows work with state internally, too. It is hard to say in general with approach is the better one.

As Flink's window semantics are very rich, I would suggest to use windows. If you cannot express your semantics with windows, using state can be a good alternative. Using windows, has the additional advantage that state handling---which is hard to get done right---is done automatically for you.

The decision is definitely independent from your data arrival rate. Flink does not drop any data. If you work with event time (rather than with processing time) your result will be the same independently of the data arrival rate after all.

这篇关于在 Flink 流中使用状态和窗口(时间)的区别的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆