使用每个元素的Apache Beam流式写入gcs [英] streaming write to gcs using apache beam per element
问题描述
当前束流管道正在使用 FileIO.matchAll().continuously()
将文件读取为流.这将返回PCollection.我想将这些文件以相同的名称写回到另一个gcs存储桶中,即每个 PCollection
是一个文件 metadata/可读File
,应在进行一些处理后将其写回到另一个存储桶中.我是否应该使用任何接收器来实现将每个 PCollection
项写回到GCS,或者有任何方法可以做到这一点?是否可以为每个元素创建一个窗口,然后使用一些GCS接收器IO来执行此操作.在窗口上操作(即使它具有多个元素)时,beam是否保证窗口已完全处理或根本未处理,换句话说就是对给定的 GCS或bigquery
的写操作窗口是原子的,如果发生任何故障,不是局部的?
Current beam pipeline is reading files as stream using FileIO.matchAll().continuously()
. This returns PCollection .
I want to write these files back with the same names to another gcs bucket i.e each PCollection
is one file metadata/readableFile
which should be written back to another bucket after some processing. Is there any sink that i should use to achieve writing each PCollection
item back to GCS or are there any ways to do it ?
Is it possible to create a window per element and then use some GCS sink IO to be able to do this.
When operating on a window (even if it has multiple elements) , does beam guarantees that either a window is fully processed or not processed at all , in other words are write operations to GCS or bigquery
for a given window atomic and not partial in case of any failures ?
推荐答案
您能简单地编写一个 DoFn< ReadableFile,Void>
来获取文件,并使用 FileSystems
API?您不需要任何接收器"来执行此操作-无论如何,这就是所有接收器"( TextIO.write()
, AvroIO.write()反之亦然):它们只是由
ParDo
和 GroupByKey
组成的Beam变换.
Can you simply write a DoFn<ReadableFile, Void>
that takes the file and copies it to the desired location using the FileSystems
API? You don't need any "sink" to do that - and, in any case, this is what all "sinks" (TextIO.write()
, AvroIO.write()
etc.) are under the hood anyway: they are simply Beam transforms made of ParDo
's and GroupByKey
's.
这篇关于使用每个元素的Apache Beam流式写入gcs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!