Apache Flink如何处理倾斜的数据? [英] How Apache Flink deal with skewed data?

查看:125
本文介绍了Apache Flink如何处理倾斜的数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有大量的单词,想统计每个单词.问题是这些词是歪斜的.这意味着某些单词的频率会很高,而其他大多数单词的频率会很低.在风暴中,我们可以使用以下方式解决此问题.首先对流进行随机分组,在每个节点中统计窗口时间内本地的单词数,最后更新为累加结果. 从我的另一个问题中,我知道Flink仅支持键控流上的窗口,否则窗口操作将不会并行.

For example, I have a big stream of words and want to count each word. The problem is these words is skewed. It means that the frequency of some words would be very high, but that of most other words is low. In storm, we could use the following way to solve this issue. First do shuffle grouping on the stream, in each node count words local in a window time, at the end update counts to cumulative results. From my another question, I know that Flink only supports window on a keyed stream, otherwise the window operation will not be parallel.

我的问题是,有什么好的方法可以解决Flink中这种偏斜的数据问题吗?

My question is is there a good way to solve this kind of skewed data issue in Flink?

推荐答案

DataStream API当前不支持预聚合.原则上,可以为事件时间窗口添加类似合并器的功能. IMO,这将是非常有价值的补充,但尚未完成.

Pre-aggregation is currently not natively supported by the DataStream API. In principle, it is possible to add a combiner-like feature for event-time windows. IMO, this would be a very valuable addition but hasn't been done yet.

但是,您可以自己实现此功能. DataStream API提供了类似于Storm Bolts的低级操作员界面.该接口称为OneInputStreamOperator.此运算符类型使您可以完全控制.实际上,内置运算符(例如Window运算符)也基于此类.

However, you can implement this feature yourself. The DataStream API offers low-level operator interface which is similar to Storm Bolts. The interface is called OneInputStreamOperator. This operator type gives you full control. In fact, the built-in operators (such as Window operators) are also based on this class.

A OneInputStreamOperator可以像这样应用:

A OneInputStreamOperator can be applied like:

DataStream<Tuple2<String,Integer> inStream = ...
DataStream<String> outStream = inStream
  .transform("my op", BasicTypeInfo.STRING_TYPE_INFO, new MyOISO());

这篇关于Apache Flink如何处理倾斜的数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆