使用每个元素的 apache 光束流式写入 gcs [英] streaming write to gcs using apache beam per element

本文介绍了使用每个元素的 apache 光束流式写入 gcs的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当前光束管道正在使用 FileIO.matchAll().continuously() 将文件作为流读取.这将返回 PCollection .我想用相同的名称将这些文件写回另一个 gcs 存储桶,即每个 PCollection 是一个文件 metadata/readableFile ,经过一些处理后应该写回另一个存储桶.我应该使用任何接收器来实现将每个 PCollection 项目写回 GCS 或者有什么方法可以做到吗?是否可以为每个元素创建一个窗口,然后使用一些 GCS 接收器 IO 来执行此操作.当在一个窗口上操作时(即使它有多个元素),beam 是否保证一个窗口被完全处理或根本不处理,换句话说就是对给定的 GCS 或 bigquery 的写操作窗口原子性而不是部分失败,以防万一?

Current beam pipeline is reading files as stream using FileIO.matchAll().continuously(). This returns PCollection . I want to write these files back with the same names to another gcs bucket i.e each PCollection is one file metadata/readableFile which should be written back to another bucket after some processing. Is there any sink that i should use to achieve writing each PCollection item back to GCS or are there any ways to do it ? Is it possible to create a window per element and then use some GCS sink IO to be able to do this. When operating on a window (even if it has multiple elements) , does beam guarantees that either a window is fully processed or not processed at all , in other words are write operations to GCS or bigquery for a given window atomic and not partial in case of any failures ?

推荐答案

您能否简单地编写一个 DoFn 来获取文件并使用 文件系统 API?你不需要任何接收器"来做到这一点 - 无论如何,这就是所有接收器"(TextIO.write(), AvroIO.write()code> 等)无论如何都在幕后:它们只是由 ParDoGroupByKey 组成的光束变换.

Can you simply write a DoFn<ReadableFile, Void> that takes the file and copies it to the desired location using the FileSystems API? You don't need any "sink" to do that - and, in any case, this is what all "sinks" (TextIO.write(), AvroIO.write() etc.) are under the hood anyway: they are simply Beam transforms made of ParDo's and GroupByKey's.

这篇关于使用每个元素的 apache 光束流式写入 gcs的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆