Apache Flink 如何处理偏斜数据? [英] How Apache Flink deal with skewed data?

查看:27
本文介绍了Apache Flink 如何处理偏斜数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

例如,我有一大堆单词,想计算每个单词.问题是这些词是歪曲的.这意味着某些单词的频率会很高,但大多数其他单词的频率很低.在storm中,我们可以使用下面的方法来解决这个问题.首先对流进行shuffle分组,在每个节点计算一个窗口时间内本地的单词,最后更新计数到累积结果.从我的另一个

解决方案

DataStream API 目前不支持预聚合.原则上,可以为事件时间窗口添加类似组合器的功能.IMO,这将是一个非常有价值的补充,但尚未完成.

但是,您可以自己实现此功能.DataStream API 提供类似于 Storm Bolts 的低级操作员界面.该接口称为OneInputStreamOperator.此运算符类型为您提供完全控制权.其实内置的操作符(比如Window操作符)也是基于这个类的.

OneInputStreamOperator 可以像这样应用:

DataStream输入流 = ...数据流<字符串>输出流 = 输入流.transform("my op", BasicTypeInfo.STRING_TYPE_INFO, new MyOISO());

For example, I have a big stream of words and want to count each word. The problem is these words is skewed. It means that the frequency of some words would be very high, but that of most other words is low. In storm, we could use the following way to solve this issue. First do shuffle grouping on the stream, in each node count words local in a window time, at the end update counts to cumulative results. From my another question, I know that Flink only supports window on a keyed stream, otherwise the window operation will not be parallel.

My question is is there a good way to solve this kind of skewed data issue in Flink?

解决方案

Pre-aggregation is currently not natively supported by the DataStream API. In principle, it is possible to add a combiner-like feature for event-time windows. IMO, this would be a very valuable addition but hasn't been done yet.

However, you can implement this feature yourself. The DataStream API offers low-level operator interface which is similar to Storm Bolts. The interface is called OneInputStreamOperator. This operator type gives you full control. In fact, the built-in operators (such as Window operators) are also based on this class.

A OneInputStreamOperator can be applied like:

DataStream<Tuple2<String,Integer> inStream = ...
DataStream<String> outStream = inStream
  .transform("my op", BasicTypeInfo.STRING_TYPE_INFO, new MyOISO());

这篇关于Apache Flink 如何处理偏斜数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆