BigQuery writeTableRows 始终写入缓冲区 [英] BigQuery writeTableRows Always writing to buffer

查看:26
本文介绍了BigQuery writeTableRows 始终写入缓冲区的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试使用 Apache Beam 和 avro 写入 Big Query.

We are trying to write to Big Query using Apache Beam and avro.

以下似乎工作正常:-

p.apply("Input", AvroIO.read(DataStructure.class).from("AvroSampleFile.avro"))
            .apply("Transform", ParDo.of(new CustomTransformFunction()))
            .apply("Load", BigQueryIO.writeTableRows().to(table).withSchema(schema));

然后我们尝试通过以下方式使用它从 Google Pub/Sub 获取数据

We then tried to use it in the following manner to get data from the Google Pub/Sub

p.begin()
            .apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
            .apply("Transform", ParDo.of(new CustomTransformFunction()))
            .apply("Write", BigQueryIO.writeTableRows()
                    .to(table)
                    .withSchema(schema)
                    .withTimePartitioning(timePartitioning)
                    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
        p.run().waitUntilFinish();

当我们这样做时,它总是将其推送到缓冲区,而 Big Query 似乎需要很长时间才能从缓冲区读取.谁能告诉我为什么上面不会将记录直接写入 Big Query 表?

When we do this it always pushes it to the buffer and Big Query seems to take a long time to read from the buffer. Can anyone tell me why the above won't write the records directly to the Big Query tables?

更新:-看起来我需要添加以下设置,但这会引发 java.lang.IllegalArgumentException.

UPDATE:- It looks like I need add the following settings but this throws an java.lang.IllegalArgumentException.

.withMethod(Method.FILE_LOADS)
.withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))

推荐答案

答案是你需要像这样包含withNumFileShards"(可以是 1 到 1000).

The answer is you need to include "withNumFileShards" like so (Can be 1 to 1000).

        p.begin()
            .apply("Input", PubsubIO.readAvros(DataStructure.class).fromTopic("topicName"))
            .apply("Transform", ParDo.of(new CustomTransformFunction()))
            .apply("Write", BigQueryIO.writeTableRows()
                    .to(table)
                    .withSchema(schema)
                    .withTimePartitioning(timePartitioning)
            .withMethod(Method.FILE_LOADS)
            .withTriggeringFrequency(org.joda.time.Duration.standardMinutes(2))
            .withNumFileShards(1000)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
        p.run().waitUntilFinish();

我无法在任何地方找到这个文档来说明 withNumFileShards 是强制性的,但是我在修复后找到了一个 Jira 票证.

I can't find this documented anywhere to say that withNumFileShards is mandatory however there is a Jira ticket for this which I found after the fix.

https://issues.apache.org/jira/browse/BEAM-3198

这篇关于BigQuery writeTableRows 始终写入缓冲区的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆