通过 Google Cloud Dataflow 创建/写入 Parititoned BigQuery 表 [英] Creating/Writing to Parititoned BigQuery table via Google Cloud Dataflow

查看:25
本文介绍了通过 Google Cloud Dataflow 创建/写入 Parititoned BigQuery 表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想利用时间分区表的新 BigQuery 功能,但我不确定目前在 Dataflow SDK 1.6 版本中是否可以实现.

I wanted to take advantage of the new BigQuery functionality of time partitioned tables, but am unsure this is currently possible in the 1.6 version of the Dataflow SDK.

查看 BigQuery JSON API,创建一个一天分区表需要传入一个

Looking at the BigQuery JSON API, to create a day partitioned table one needs to pass in a

"timePartitioning": { "type": "DAY" }

选项,但 com.google.cloud.dataflow.sdk.io.BigQueryIO 接口只允许指定 TableReference.

option, but the com.google.cloud.dataflow.sdk.io.BigQueryIO interface only allows specifying a TableReference.

我想也许我可以预先创建表,然后通过 BigQueryIO.Write.toTableReference lambda 潜入分区装饰器..?有没有其他人在通过 Dataflow 创建/写入分区表方面取得成功?

I thought that maybe I could pre-create the table, and sneak in a partition decorator via a BigQueryIO.Write.toTableReference lambda..? Is anyone else having success with creating/writing partitioned tables via Dataflow?

这似乎与设置 表过期时间 类似,但不是目前可用.

This seems like a similar issue to setting the table expiration time which isn't currently available either.

推荐答案

正如 Pavan 所说,使用 Dataflow 写入分区表绝对是可能的.您使用的是在流模式还是批处理模式下运行的 DataflowPipelineRunner?

As Pavan says, it is definitely possible to write to partition tables with Dataflow. Are you using the DataflowPipelineRunner operating in streaming mode or batch mode?

您提出的解决方案应该可行.具体来说,如果您预先创建了一个设置了日期分区的表,那么您可以使用 BigQueryIO.Write.toTableReference lambda 来写入日期分区.例如:

The solution you proposed should work. Specifically, if you pre-create a table with date partitioning set up, then you can use a BigQueryIO.Write.toTableReference lambda to write to a date partition. For example:

/**
 * A Joda-time formatter that prints a date in format like {@code "20160101"}.
 * Threadsafe.
 */
private static final DateTimeFormatter FORMATTER =
    DateTimeFormat.forPattern("yyyyMMdd").withZone(DateTimeZone.UTC);

// This code generates a valid BigQuery partition name:
Instant instant = Instant.now(); // any Joda instant in a reasonable time range
String baseTableName = "project:dataset.table"; // a valid BigQuery table name
String partitionName =
    String.format("%s$%s", baseTableName, FORMATTER.print(instant));

这篇关于通过 Google Cloud Dataflow 创建/写入 Parititoned BigQuery 表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆