Apache Flink DataStream API 没有 mapPartition 转换 [英] Apache Flink DataStream API doesn't have a mapPartition transformation

查看:24
本文介绍了Apache Flink DataStream API 没有 mapPartition 转换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Spark DStream 有 mapPartition API,而 Flink DataStream API 没有.有没有人可以帮忙解释一下原因.我想做的是在Flink上实现一个类似于Spark reduceByKey 的API.

Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.

推荐答案

Flink 的流处理模型与以小批量为中心的 Spark Streaming 有很大不同.在 Spark Streaming 中,每个 minibatch 都像对有限数据集的常规批处理程序一样执行,而 Flink DataStream 程序则连续处理记录.

Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.

在 Flink 的 DataSet API 中,一个 MapPartitionFunction 有两个参数.输入的迭代器和函数结果的收集器.Flink DataStream 程序中的 MapPartitionFunction 永远不会从第一次函数调用中返回,因为迭代器会遍历无穷无尽的记录流.但是,Flink 的内部流处理模型要求用户函数返回,以便检查点函数状态.因此,DataStream API 不提供 mapPartition 转换.

In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.

为了实现类似于 Spark Streaming 的 reduceByKey 的功能,您需要在流上定义一个键控窗口.Windows 离散流,这有点类似于小批量,但 Windows 提供了更多的灵活性.由于窗口大小有限,您可以调用reduce窗口.

In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.

这可能看起来像:

yourStream.keyBy("myKey") // organize stream by key "myKey"
          .timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
          .reduce(new YourReduceFunction); // apply a reduce function on each window

DataStream 文档展示了如何定义各种窗口类型并解释所有可用功能.

The DataStream documentation shows how to define various window types and explains all available functions.

注意:最近重新设计了 DataStream API.该示例假设最新版本 (0.10-SNAPSHOT) 将在未来几天发布为 0.10.0.

Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.

这篇关于Apache Flink DataStream API 没有 mapPartition 转换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆