从Google Cloud Dataflow输出排序的文本文件 [英] Output sorted text file from Google Cloud Dataflow
问题描述
我在Google Cloud DataFlow中有一个PCollection<String>
,并且正在通过TextIO.Write.to
将其输出到文本文件中:
I have a PCollection<String>
in Google Cloud DataFlow and I'm outputting it to text files via TextIO.Write.to
:
PCollection<String> lines = ...;
lines.apply(TextIO.Write.to("gs://bucket/output.txt"));
当前每个输出分片的行是随机的.
Currently the lines of each shard of output are in random order.
是否有可能使Dataflow以排序的顺序输出行?
Is it possible to get Dataflow to output the lines in sorted order?
推荐答案
Dataflow不直接支持.
This is not directly supported by Dataflow.
对于有界的PCollection
,如果将输入分片得足够细,则可以使用 TextSink
实现的基本轮廓.
For a bounded PCollection
, if you shard your input finely enough, then you can write sorted files with a Sink
implementation that sorts each shard. You may want to refer to the TextSink
implementation for a basic outline.
这篇关于从Google Cloud Dataflow输出排序的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!