将数据流管道的输出写入分区目标 [英] Writing Output of a Dataflow Pipeline to a Partitioned Destination

查看:76
本文介绍了将数据流管道的输出写入分区目标的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们只有一个流事件源,每秒有数千个事件,这些事件都标有一个ID,以标识该事件属于我们成千上万的客户中的哪个.我们想使用此事件源填充数据仓库(以流模式),但是我们的事件源不是持久性的,因此我们也想将原始数据存档在GCS中,以便我们可以通过数据重播仓库管道,如果我们进行需要的更改.由于数据保留的要求,我们保留的所有原始数据都需要由客户进行分区,以便我们可以轻松地将其删除.

We have a single streaming event source with thousands of events per second, these events are all marked with an id identifying which of our tens of thousands of customers the event belongs to. We'd like to use this event source to populate a data warehouse (in streaming mode), however, our event source is not persistent, so we'd also like to archive the raw data in GCS so we can replay it through our data warehouse pipeline if we make a change that requires it. Because of data retention requirements, any raw data that we persist needs to be partitioned by customer, so that we can easily delete it.

在Dataflow中解决此问题的最简单方法是什么?当前,我们正在使用自定义接收器创建数据流作业,该接收器将数据写入GCS/BigQuery上每个客户的文件中,这是明智的吗?

What would the simplest way to solve this in Dataflow be? Currently we're creating a dataflow job with a custom sink that writes the data to files per-customer on GCS/BigQuery, is that sensible?

推荐答案

要指定文件名和路径,请参见

In order to specify the filename and path, please see the TextIO documentation. You would provide the filename / path etc. to the output writer.

对于使用多个输出文件的情况,您可以使用分区 a>函数可从单个源PCollection中创建多个PCollections.

For your use case of multiple output files, you can use the Partition function to create multiple PCollections out of a single source PCollection.

这篇关于将数据流管道的输出写入分区目标的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆