重新创建BigQuery表流式插入后不起作用? [英] After recreating BigQuery table streaming inserts are not working?

查看:89
本文介绍了重新创建BigQuery表流式插入后不起作用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我刚刚在BigQuery上遇到了一个有趣的问题。



实质上,有一个批处理作业在BigQuery中重新创建表格 - 删除数据 - 而不是立即开始通过流式接口输入新集合。



曾经这样工作很长一段时间 - 成功。

最近它开始丢失数据。

一个小的测试用例已经证实了这种情况 - 如果数据馈送在重新创建(成功!)表后立即开始,那么数据集的一部分将会丢失。
I.e.在4000个记录中,只有2100 - 3500个记录才能完成。



我怀疑在表操作之前表创建可能会成功(删除和创建)已经成功地在整个环境中传播,因此数据集的第一部分被馈送到表格的旧副本(在此处进行推测)。

为了证实这一点我在创建表和启动数据馈送之间放置了一个超时。事实上,如果超时时间少于120秒 - 数据集的一部分会丢失。



如果超过120秒 - 看起来工作正常。



过去并没有要求这个超时。我们正在使用美国的BigQuery。
我在这里丢失了一些明显的东西吗?



编辑:从Sean Chen提供的评论和其他一些资料 - 预期行为会因方式表被缓存并且内部表ID被传播通过系统。 BigQuery已经构建为仅附加操作类型。重新写入不是一个人可以很容易地适应设计,应该避免。

解决方案

这或多或少是预期的这是由于BigQuery流式服务器缓存表生成ID(表的内部名称)的方式。

您能否提供更多关于用例的信息?看起来很奇怪删除表然后再次写入同一个表。



一种解决方法是截断表,而不是删除它。您可以通过运行 SELECT * FROM< table> LIMIT 0 ,并将该表作为目标表(您可能希望使用allow_large_results = true并禁用展平,这将有助于嵌套数据),然后使用write_disposition = WRITE_TRUNCATE。这将清空表格但保留模式。然后,以后流式传输的任何行都会应用到同一个表中。


I just came a cross an interesting issue with the BigQuery.

Essentially there is a batch job that recreates a table in BigQuery - to delete the data - and than immediately starts to feed in a new set through streaming interface.

Used to work like this for quite a while - successfully.

Lately it started to loose data.

A small test case has confirmed the situation – if the data feed starts immediately after recreating (successfully!) the table, parts of the dataset will be lost. I.e. Out of 4000 records that are being fed in, only 2100 - 3500 would make it through.

I suspect that table creation might be returning success before the table operations (deletion and creation) have been successfully propagated throughout the environment, thus the first parts of the dataset are being fed to the old replicas of the table (speculating here).

To confirm this I have put a timeout between the table creation and starting the data feed. Indeed, if the timeout is less than 120 seconds – parts of the dataset are lost.

If it is more than 120 seconds - seems to work OK.

There used to be no requirement for this timeout. We are using US BigQuery. Am I missing something obvious here?

EDIT: From the comment provided by Sean Chen below and a few other sources - the behaviour is expected due to the way the tables are cached and internal table id is propagated through out the system. BigQuery has been built for append-only type of operations. Re-writes is not something that one can easily accomodate into the design and should be avoided.

解决方案

This is more or less expected due to the way that BigQuery streaming servers cache the table generation id (an internal name for the table).

Can you provide more information about the use case? It seems strange to delete the table then to write to the same table again.

One workaround could be to truncate the table, instead of deleting the it. You can do this by running SELECT * FROM <table> LIMIT 0, and the table as a destination table (you might want to use allow_large_results = true and disable flattening, which will help if you have nested data), then using write_disposition=WRITE_TRUNCATE. This will empty out the table but preserve the schema. Then any rows streamed afterwards will get applied to the same table.

这篇关于重新创建BigQuery表流式插入后不起作用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆