Apache Flink 中的全局排序 [英] Global sorting in Apache Flink
问题描述
数据集的sortPartition方法根据一些指定的字段在本地对数据集进行排序.如何在 Flink 中以有效的方式对我的大型数据集进行全局排序?
sortPartition method of a dataset sorts the dataset locally based on some specified fields. How can I get my large Dataset sorted globally in an efficient way in Flink?
推荐答案
这目前不太容易实现,因为 Flink 还没有提供内置的范围分区策略.
This is currently not easily possible because Flink does not provide a built-in range partitioning strategy, yet.
解决方法是实现自定义Partitioner
:
A work-around is to implement a custom Partitioner
:
DataSet<Tuple2<Long, Long>> data = ...
data
.partitionCustom(new Partitioner<Long>() {
int partition(Long key, int numPartitions) {
// your implementation
}
}, 0)
.sortPartition(0, Order.ASCENDING)
.writeAsText("/my/output");
注意:为了使用自定义分区器实现平衡分区,您需要了解键的值范围和分布.
Note: In order to achieve balanced partitions with a custom partitioner, you need to know about the value range and distribution of the key.
Apache Flink 中对范围分区器(具有自动采样)的支持目前正在正在进行中 并且应该很快可用.
Support for a range partitioner (with automatic sampling) in Apache Flink is currently work in progress and should be available soon.
编辑(2016 年 6 月 7 日):范围分区已添加到版本 1.0.0 的 Apache Flink.您可以按如下方式对数据集进行全局排序:
Edit (June 7th, 2016): Range partitioning was added to Apache Flink with version 1.0.0. You can globally sort a data set as follows:
DataSet<Tuple2<Long, Long>> data = ...
data
.partitionByRange(0)
.sortPartition(0, Order.ASCENDING)
.writeAsText("/my/output");
请注意,范围分区对输入数据集进行采样,以计算大小相等的分区的数据分布.
Note that range partitioning samples the input data set to compute a data distribution for equally-sized partitions.
这篇关于Apache Flink 中的全局排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!