数据流中的动态bigquery表名 [英] Dynamic bigquery table names in dataflow

查看:168
本文介绍了数据流中的动态bigquery表名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本上,我们希望根据特定列的值(而不是日期)将大型(数十亿行)bigquery表拆分为大量(大约为10万个)较小的表。我无法弄清楚如何在bigquery中有效地做到这一点,所以我正在考虑使用数据流。



使用数据流,我们可以先加载数据,然后为每个记录创建一个键值对,关键字是我们想要拆分表的特定列的所有可能值,然后我们可以通过键对记录进行分组。所以在这个操作之后,我们有(密钥,[记录])的PCollection。然后我们需要将PCollection写回bigquery表,表名可以是key_table。



所以操作是:p | beam.io.Read(beam.io.BigQuerySource())| beam.map(lambda记录:(record ['splitcol'],record))| beam.GroupByKey()| beam.io.Write(beam.io.BigQuerySink)

现在的关键问题是如何根据每个元素的值在最后一步写入不同的表在PCollection中。



这个问题与另一个问题有关:
在Apache Beam中为不同的BigQuery表写入不同的值。但是我是一个python人,不确定在Python SDK中是否也有相同的解决方案。

依赖 BigQueryIO.write())仅在Beam Java中受支持。不幸的是,我想不出一种简单的方法来使用Beam Python来模拟它,而不是重新实现相应的Java代码。请随时打开 JIRA 功能请求。



我猜想最简单的事情是编写一个 DoFn 来手动将行写入相应的表,使用 BigQuery流式插入API (而不是Beam BigQuery连接器),但请记住,流式传输插入比昂贵的导入更加昂贵,并且需要比批量导入更严格的配额策略(当写入一个有界的 PCollection )时,它被Java BigQuery连接器使用。



在Beam中也有一些工作允许跨语言重用变换 - 一个设计正在讨论中 https://s.apache.org/beam-mixed-language-pipelines 。完成该工作后,您将能够使用Python管道中的Java BigQuery连接器。


Basicly we want to split a big (billions of rows) bigquery table into a large number (can be around 100k) smaller tables based on the value of a particular column (not date). I can't figure out how to do it efficiently in bigquery itself, so I am thinking of using dataflow.

With dataflow, we can first load the data from , then create a key value pair for each record, the key is all the possible values for the particular column we want to split the table, then we can group the records by the key. so after this operation, we have PCollection of the (key, [records]). we would then need to write PCollection back to bigquery table, the table name can be key_table.

So the operation would be: p | beam.io.Read(beam.io.BigQuerySource()) | beam.map(lambda record : (record['splitcol'], record)) | beam.GroupByKey() | beam.io.Write(beam.io.BigQuerySink)

The key question now is how do I write to different tables in the last step based on the value in each element in PCollection.

This question is somehow related to the another question: Writing different values to different BigQuery tables in Apache Beam. But I am a python guy, not sure if the same solution is possible in Python SDK also.

解决方案

Currently this feature (value-dependent BigQueryIO.write()) is only supported in Beam Java. Unfortunately I can't think of an easy way to mimic it using Beam Python, short of reimplementing the respective Java code. Please feel free to open a JIRA feature request.

I guess the simplest thing that comes to mind is writing a DoFn to manually write your rows to the respective tables, using the BigQuery streaming insert API (rather than the Beam BigQuery connector), however keep in mind that streaming inserts are more expensive and subject to more strict quota policies than bulk imports (which are used by the Java BigQuery connector when writing a bounded PCollection).

There is also work happening in Beam on allowing reuse of transforms across languages - a design is being discussed at https://s.apache.org/beam-mixed-language-pipelines. When that work is completed, you would be able to use the Java BigQuery connector from a Python pipeline.

这篇关于数据流中的动态bigquery表名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆