我可以将自定义分区程序与group by一起使用吗? [英] Can I use a custom partitioner with group by?

查看：141 发布时间：2020/11/8 21:09:49 apache-flink flink-streaming

本文介绍了我可以将自定义分区程序与group by一起使用吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

比方说，我知道我的数据集是不平衡的，而且我知道密钥的分布.我想利用此功能编写一个自定义分区程序，以充分利用运算符实例.

Let's say that I know that my dataset is unbalanced and I know the distribution of the keys. I'd like leverage this to write a custom partitioner to get the most out of the operator instances.

我了解

I know about DataStream#partitionCustom. However, if my stream is keyed, will it still work properly? My job would look something like:

KeyedDataStream afterCustomPartition = keyedStream.partitionCustom(new MyPartitioner(), MyPartitionKeySelector())

DataStreamUtils.reinterpretAsKeyedStream(afterCustomPartition, new MyGroupByKeySelector<>()).sum()

我想要实现的是:

具有流键通过按某个键，可以仅使用该键中的元素来调用reduce函数.

该小组根据一些自定义分区将工作划分为多个节点.

自定义分区将根据并行运算符实例的数量返回一个数字(该数字将是固定的，并且不会进行重新缩放).

自定义分区从keyBy返回不同的值.但是，keyBy(x) = keyBy(y) => partition(x) = partition(y).

具有预聚合，以在分区之前最大程度地减少网络流量.

Having a stream keyBy according to some key so that the reduce function will only be called with elements from that key.
The group by split the work across nodes based on some custom partitioning.
The custom partitioning returning a number based on the number of parallel operator instances (which will be fixed and not subject to rescaling).
The custom partioning returning different values from the keyBy. However, keyBy(x) = keyBy(y) => partition(x) = partition(y).
Having pre-aggregation to minimize network traffic before partitioning.

用例示例:

数据集:[(0，A)，(0，B)，(0，C)，(1，D)，(2，E)]
并行运算符实例数:2
按功能分组:返回该对中的第一个元素
分区功能:为键0返回0，为键1和2返回1.优点:处理可能将键0和1发送到同一操作员实例的数据偏移，这意味着一个操作员实例将收到80％数据集.

我可以将自定义分区程序与group by一起使用吗? [英] Can I use a custom partitioner with group by?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

我可以将自定义分区程序与group by一起使用吗? [英] Can I use a custom partitioner with group by?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭