Apache Flink 中的全局排序 [英] Global sorting in Apache Flink

查看:50
本文介绍了Apache Flink 中的全局排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

数据集的sortPartition方法根据一些指定的字段在本地对数据集进行排序.如何在 Flink 中以有效的方式对我的大型数据集进行全局排序?

sortPartition method of a dataset sorts the dataset locally based on some specified fields. How can I get my large Dataset sorted globally in an efficient way in Flink?

推荐答案

这目前不太容易实现,因为 Flink 还没有提供内置的范围分区策略.

This is currently not easily possible because Flink does not provide a built-in range partitioning strategy, yet.

解决方法是实现自定义Partitioner:

A work-around is to implement a custom Partitioner:

DataSet<Tuple2<Long, Long>> data = ...
data
  .partitionCustom(new Partitioner<Long>() {
    int partition(Long key, int numPartitions) {
      // your implementation
    }
  }, 0)
  .sortPartition(0, Order.ASCENDING)
  .writeAsText("/my/output");

注意:为了使用自定义分区器实现平衡分区,您需要了解键的值范围和分布.

Note: In order to achieve balanced partitions with a custom partitioner, you need to know about the value range and distribution of the key.

Apache Flink 中对范围分区器(具有自动采样)的支持目前正在正在进行中 并且应该很快可用.

Support for a range partitioner (with automatic sampling) in Apache Flink is currently work in progress and should be available soon.

编辑(2016 年 6 月 7 日):范围分区已添加到版本 1.0.0 的 Apache Flink.您可以按如下方式对数据集进行全局排序:

Edit (June 7th, 2016): Range partitioning was added to Apache Flink with version 1.0.0. You can globally sort a data set as follows:

DataSet<Tuple2<Long, Long>> data = ...
data
  .partitionByRange(0)
  .sortPartition(0, Order.ASCENDING)
  .writeAsText("/my/output");

请注意,范围分区对输入数据集进行采样,以计算大小相等的分区的数据分布.

Note that range partitioning samples the input data set to compute a data distribution for equally-sized partitions.

这篇关于Apache Flink 中的全局排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆