Flink:汇总所有分区结果的最佳方法是什么 [英] Flink: What is the best way to summarize the result from all partitions

查看:432
本文介绍了Flink:汇总所有分区结果的最佳方法是什么的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据流被分区并分配到每个插槽以进行处理.现在,我可以获得每个分区任务的结果.将某些功能应用于不同分区的结果并获得全局汇总结果的最佳方法是什么?

The datastream is partitioned and distributed to each slot for processing. Now I can get the result of each partitioned task. What is the best approach to apply some function to those result of different partitions and get a global summary result?

已更新: 我想实现一些数据汇总算法,例如Flink中的Misra-Gries.它将维护k个计数器,并随着数据的到达进行更新.因为数据可能具有较大的可伸缩性,所以每个分区最好有自己的k个计数器并并行处理.最后将这些计数器合并到最后的k个计数器以显示结果.最好的组合方式是什么?

Updated: I want to implement some data summary algorithm such as Misra-Gries in Flink. It will maintain k counters and update with data arriving. Because data may be large scalable, It's better that each partition has its own k counters and process parallel. Finally merge those counters to final k counters to present the result. What is the best way to do combination?

推荐答案

Flink的内置聚合功能(如reducesummax)建立在Flink的托管键控状态机制的基础上,并且可以仅适用于KeyedStream.但是,您可以使用

Flink's built-in aggregation functions, like reduce, sum, and max are built on top of Flink's managed keyed state mechanism, and can only be applied to a KeyedStream. What you can do, however, is use either WindowAll or ProcessFunction. Here is an example:

parallelStream
  .process(new MyProcessFunction())
  .setParallelism(1)
  .print()
  .setParallelism(1);

请注意,所有的初步处理都是在默认的并行度下完成的,然后依次应用处理功能和打印.

Note that all of the preliminary processing is being done at the default parallelism, and then the process function and print are being applied serially.

ProcessFunction应该将其状态保持在

The ProcessFunction should keep its state in managed operator (non-keyed) state in order to be fault tolerant.

这将在整个输入上产生不断更新的摘要流.如果您希望在Windows上生成摘要,请使用countWindowAlltimeWindowAll之类的东西.

This will produce a continuously updated stream of summaries over the entire input. Use something like countWindowAll or timeWindowAll if you prefer to produce summaries over windows.

这篇关于Flink:汇总所有分区结果的最佳方法是什么的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆