将数据流式传输到BigQuery中的旋转日志表中 [英] Stream data into rotating log tables in BigQuery

查看:133
本文介绍了将数据流式传输到BigQuery中的旋转日志表中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用 insertAll 将一些时间序列数据流入BigQuery,但只保留最后3个月(比如说)以避免无限存储成本。通常的答案是将每天的数据保存到一个单独的表格,但AFAICT这需要每个这样的表格将被预先创建。我打算直接从只有 bigquery.insertdata 作用域的令牌授权的不安全客户端传输数据,因此他们无法自行创建日常表。我能想到的唯一解决方案就是运行一个安全的每日cron作业来创建表格 - 这并不理想,特别是因为如果它失火,数据将被丢弃,直到创建表格。



另一种方法是将数据流式传输到单个表中,然后使用 table decorators < a>来控制查询成本随着表的增长而变化。 (我希望所有的查询都是针对特定的时间范围,因此装饰器在这里应该非常有效)。但是,无法从表中删除旧数据,因此存储成本在一段时间后将变得无法持续。我无法想出任何方法来自动复制和截断表格,以便我可以将旧数据分区到日常表格中,而不会丢失当前的流。



关于如何解决这个问题的任何想法?如果您的解决方案允许我将旧数据重新聚合到时间较粗的行中以保留更多历史记录以获得相同的存储成本,则可以获得奖励积分。感谢。

编辑:刚刚意识到这是 Bigquery事件流和表创建

解决方案

如果您查看流式API发现文档,那么有一个名为templateSuffix的好奇的新实验性字段,其中有相关的描述。

官方文件已经发布,所以应该特别注意使用这个领域 - 特别是在生产环境中。实验领域可能有错误等等。我认为要小心从头到尾的是:


  • 修改模式以非向后兼容的方式执行基表。

  • 以与基表不兼容的方式直接修改创建的表的模式。
  • 直接通过此后缀流式传输到创建的表中 - 行插入id可能不适用于跨界。

  • 在创建的表上进行操作时,它正在主动流式传输。
  • >


我相信其他的东西。无论如何,只是想我会指出。我确信官方文件将会更彻底。


I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.

Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.

Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.

Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.

解决方案

If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.

I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:

  • Modifying the schema of the base table in non-backwards-compatible ways.
  • Modifying the schema of a created table directly in a way that is incompatible with the base table.
  • Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
  • Performing operations on the created table while it's actively being streamed to.

And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.

这篇关于将数据流式传输到BigQuery中的旋转日志表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆