使用每个元素的Apache Beam流式写入gcs [英] streaming write to gcs using apache beam per element

本文介绍了使用每个元素的Apache Beam流式写入gcs的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前束流管道正在使用 FileIO.matchAll().continuously()将文件读取为流.这将返回PCollection.我想将这些文件以相同的名称写回到另一个gcs存储桶中,即每个 PCollection 是一个文件 metadata/可读File ,应在进行一些处理后将其写回到另一个存储桶中.我是否应该使用任何接收器来实现将每个 PCollection 项写回到GCS,或者有任何方法可以做到这一点?是否可以为每个元素创建一个窗口,然后使用一些GCS接收器IO来执行此操作.在窗口上操作(即使它具有多个元素)时,beam是否保证窗口已完全处理或根本未处理,换句话说就是对给定的 GCS或bigquery 的写操作窗口是原子的,如果发生任何故障,不是局部的?

Current beam pipeline is reading files as stream using FileIO.matchAll().continuously(). This returns PCollection . I want to write these files back with the same names to another gcs bucket i.e each PCollection is one file metadata/readableFile which should be written back to another bucket after some processing. Is there any sink that i should use to achieve writing each PCollection item back to GCS or are there any ways to do it ? Is it possible to create a window per element and then use some GCS sink IO to be able to do this. When operating on a window (even if it has multiple elements) , does beam guarantees that either a window is fully processed or not processed at all , in other words are write operations to GCS or bigquery for a given window atomic and not partial in case of any failures ?

推荐答案

您能简单地编写一个 DoFn< ReadableFile,Void> 来获取文件,并使用 FileSystems API?您不需要任何接收器"来执行此操作-无论如何,这就是所有接收器"( TextIO.write() AvroIO.write() ParDo GroupByKey 组成的Beam变换.

Can you simply write a DoFn<ReadableFile, Void> that takes the file and copies it to the desired location using the FileSystems API? You don't need any "sink" to do that - and, in any case, this is what all "sinks" (TextIO.write(), AvroIO.write() etc.) are under the hood anyway: they are simply Beam transforms made of ParDo's and GroupByKey's.

这篇关于使用每个元素的Apache Beam流式写入gcs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆