从Google Cloud Dataflow输出排序的文本文件 [英] Output sorted text file from Google Cloud Dataflow

查看:43
本文介绍了从Google Cloud Dataflow输出排序的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Google Cloud DataFlow中有一个PCollection<String>,并且正在通过TextIO.Write.to将其输出到文本文件中:

I have a PCollection<String> in Google Cloud DataFlow and I'm outputting it to text files via TextIO.Write.to:

PCollection<String> lines = ...;
lines.apply(TextIO.Write.to("gs://bucket/output.txt"));

当前每个输出分片的行是随机的.

Currently the lines of each shard of output are in random order.

是否有可能使Dataflow以排序的顺序输出行?

Is it possible to get Dataflow to output the lines in sorted order?

推荐答案

Dataflow不直接支持.

This is not directly supported by Dataflow.

对于有界的PCollection,如果将输入分片得足够细,则可以使用

For a bounded PCollection, if you shard your input finely enough, then you can write sorted files with a Sink implementation that sorts each shard. You may want to refer to the TextSink implementation for a basic outline.

这篇关于从Google Cloud Dataflow输出排序的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆