Apache Flink中的全局排序 [英] Global sorting in Apache Flink
问题描述
sortPartition方法基于一些指定的字段在本地对数据集进行排序.如何在Flink中高效地对大型数据集进行全局排序?
sortPartition method of a dataset sorts the dataset locally based on some specified fields. How can I get my large Dataset sorted globally in an efficient way in Flink?
推荐答案
由于Flink尚不提供内置的范围分区策略,因此目前尚不容易实现.
This is currently not easily possible because Flink does not provide a built-in range partitioning strategy, yet.
一种解决方法是实施自定义Partitioner
:
A work-around is to implement a custom Partitioner
:
DataSet<Tuple2<Long, Long>> data = ...
data
.partitionCustom(new Partitioner<Long>() {
int partition(Long key, int numPartitions) {
// your implementation
}
}, 0)
.sortPartition(0, Order.ASCENDING)
.writeAsText("/my/output");
注意:为了使用自定义分区程序实现平衡分区,您需要了解键的值范围和分布.
Note: In order to achieve balanced partitions with a custom partitioner, you need to know about the value range and distribution of the key.
当前在正在进行的工作中,对Apache Flink中的范围分区程序(具有自动采样)的支持 a>,并且应该很快就可以使用.
Support for a range partitioner (with automatic sampling) in Apache Flink is currently work in progress and should be available soon.
编辑(2016年6月7日):范围分区已添加到1.0.0版的Apache Flink中.您可以按以下方式对数据集进行全局排序:
Edit (June 7th, 2016): Range partitioning was added to Apache Flink with version 1.0.0. You can globally sort a data set as follows:
DataSet<Tuple2<Long, Long>> data = ...
data
.partitionByRange(0)
.sortPartition(0, Order.ASCENDING)
.writeAsText("/my/output");
请注意,范围分区会对输入数据集进行采样,以计算大小相等的分区的数据分布.
Note that range partitioning samples the input data set to compute a data distribution for equally-sized partitions.
这篇关于Apache Flink中的全局排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!