将数据流管道的输出写入分区目标 [英] Writing Output of a Dataflow Pipeline to a Partitioned Destination

查看：76 发布时间：2020/11/18 1:26:31 google-cloud-storage google-cloud-dataflow

本文介绍了将数据流管道的输出写入分区目标的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我们只有一个流事件源，每秒有数千个事件，这些事件都标有一个ID，以标识该事件属于我们成千上万的客户中的哪个.我们想使用此事件源填充数据仓库(以流模式)，但是我们的事件源不是持久性的，因此我们也想将原始数据存档在GCS中，以便我们可以通过数据重播仓库管道，如果我们进行需要的更改.由于数据保留的要求，我们保留的所有原始数据都需要由客户进行分区，以便我们可以轻松地将其删除.

We have a single streaming event source with thousands of events per second, these events are all marked with an id identifying which of our tens of thousands of customers the event belongs to. We'd like to use this event source to populate a data warehouse (in streaming mode), however, our event source is not persistent, so we'd also like to archive the raw data in GCS so we can replay it through our data warehouse pipeline if we make a change that requires it. Because of data retention requirements, any raw data that we persist needs to be partitioned by customer, so that we can easily delete it.

在Dataflow中解决此问题的最简单方法是什么?当前，我们正在使用自定义接收器创建数据流作业，该接收器将数据写入GCS/BigQuery上每个客户的文件中，这是明智的吗?

What would the simplest way to solve this in Dataflow be? Currently we're creating a dataflow job with a custom sink that writes the data to files per-customer on GCS/BigQuery, is that sensible?

将数据流管道的输出写入分区目标 [英] Writing Output of a Dataflow Pipeline to a Partitioned Destination

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

将数据流管道的输出写入分区目标 [英] Writing Output of a Dataflow Pipeline to a Partitioned Destination

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭