Apache Flink中的全局排序 [英] Global sorting in Apache Flink

查看:1417
本文介绍了Apache Flink中的全局排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

sortPartition方法基于一些指定的字段在本地对数据集进行排序.如何在Flink中高效地对大型数据集进行全局排序?

sortPartition method of a dataset sorts the dataset locally based on some specified fields. How can I get my large Dataset sorted globally in an efficient way in Flink?

推荐答案

由于Flink尚不提供内置的范围分区策略,因此目前尚不容易实现.

This is currently not easily possible because Flink does not provide a built-in range partitioning strategy, yet.

一种解决方法是实施自定义Partitioner:

A work-around is to implement a custom Partitioner:

DataSet<Tuple2<Long, Long>> data = ...
data
  .partitionCustom(new Partitioner<Long>() {
    int partition(Long key, int numPartitions) {
      // your implementation
    }
  }, 0)
  .sortPartition(0, Order.ASCENDING)
  .writeAsText("/my/output");

注意:为了使用自定义分区程序实现平衡分区,您需要了解键的值范围和分布.

Note: In order to achieve balanced partitions with a custom partitioner, you need to know about the value range and distribution of the key.

当前在正在进行的工作中,对Apache Flink中的范围分区程序(具有自动采样)的支持 a>,并且应该很快就可以使用.

Support for a range partitioner (with automatic sampling) in Apache Flink is currently work in progress and should be available soon.

编辑(2016年6月7日):范围分区已添加到1.0.0版的Apache Flink中.您可以按以下方式对数据集进行全局排序:

Edit (June 7th, 2016): Range partitioning was added to Apache Flink with version 1.0.0. You can globally sort a data set as follows:

DataSet<Tuple2<Long, Long>> data = ...
data
  .partitionByRange(0)
  .sortPartition(0, Order.ASCENDING)
  .writeAsText("/my/output");

请注意,范围分区会对输入数据集进行采样,以计算大小相等的分区的数据分布.

Note that range partitioning samples the input data set to compute a data distribution for equally-sized partitions.

这篇关于Apache Flink中的全局排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆