除了覆盖表格之外,是否还有其他更新Big Query中的行的方法? [英] Is there any other approach for updating a row in Big Query apart from overwriting the table?

查看:240
本文介绍了除了覆盖表格之外,是否还有其他更新Big Query中的行的方法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

  packageid  - >字符串
包含一些字段的包数据状态--->字符串
status_type --->字符串
扫描--->记录(重复)
scanid --->字符串
status-- - > string
scannedby --->字符串

每天,我有一个100 000包的数据。每天总包数据大小变为100 MB(大约),1个月后变为3GB。对于每个软件包,可以进行3-4次更新。那么每次包更新(例如状态字段的更改)到来时,我是否必须覆盖包表?假设我在表中有3个包的数据,现在第二个包的更新来了,是否必须覆盖整个表(删除并添加整个数据每包2次更新)?对于100 000个包,总交易量将是10 ^ 5 * 10 ^ 5 * 2/2。

是否有任何其他方法用于原子更新而不覆盖表? (如果表中包含1百万个条目,然后包更新来了,那么覆盖整个表就会成为开销。)解析方案

目前没有办法更新单个行。我们确实经常看到这个用例,我们推荐一些与Mikhail建议的类似的东西。基本上,如果您有一个逻辑行的唯一标识和行数据更新时间的时间戳,您可以简单地将每个更新添加为一个新行,并在表上应用一个视图来为您提供所需的行。



您的看法如下所示:

  SELECT * 
FROM(
SELECT
*,
MAX(< timestamp_column>)
OVER(PARTITION BY< id_column>)
AS max_timestamp,
FROM< table>

WHERE< timestamp_column> = max_timestamp

(cribbed from here 仅返回具有重复项目的BigQuery表中的最新行



如果您的表格被分区到每日表格中(或者在一段时间后变为静态),那么您可以在表格稳定后用视图查询的结果替换该视图,并改进查询效率。

例如


  • 将数据添加到TABLE_RAW。

  • 通过TABLE_RAW创建执行上述查询的视图TABLE

  • 在TABLE_RAW稳定后的某个时间点,查询TABLE,其目标表为TABLE,写入处置为WRITE_TRUNCATE。



不幸的是,这确实会增加一些开销。也就是说,对于您的用例,您可能会无限期地放置视图,这会简化一些操作。


I have a package data with some of its fields as following:

packageid-->string
status--->string
status_type--->string
scans--->record(repeated)
     scanid--->string
     status--->string
scannedby--->string

Per day, I have a data of 100 000 packages. Total package data size per day becomes 100 MB(approx) and for 1 month it becomes 3GB. For each package, 3-4 updates can come. So do I have to overwrite the package table, every time a package update (e.g. just a change in status field) comes?

Suppose I have data of 3 packages in the table and now the update for 2nd package comes, do I have to overwrite the whole table (deleting and adding the whole data takes 2 transaction per package update)? For 100 000 packages, total transactions will be 10^5 * 10^5 * 2/2.

Is there any other approach for atomic updates without overwriting the table? (as if the table contains 1 million entries and then a package update comes, then overwriting the whole table will be an overhead.)

解决方案

Currently there is no way to update individual rows. We do see this use case somewhat often, and we recommend something similar to what Mikhail suggested. Basically, if you have some unique ID for a logical row, and a timestamp of the update time to the row data, you can simply add every update as a new row, and apply a view over the table to give you the desired rows.

Your view would look something like this:

SELECT *
FROM (
  SELECT
      *,
      MAX(<timestamp_column>)
          OVER (PARTITION BY <id_column>)
          AS max_timestamp,
  FROM <table>
)
WHERE <timestamp_column> = max_timestamp

(cribbed from here Return only the newest rows from a BigQuery table with a duplicate items)

If your table is partitioned into daily tables (or becomes static after some period), you can then replace the view with the result of the view query after the table stabilizes, and improve your query efficiency.

e.g.

  • Add Data to TABLE_RAW.
  • Create view TABLE that performs the above query over TABLE_RAW
  • At some point after TABLE_RAW is stable, query TABLE with a destination table of TABLE, with write disposition WRITE_TRUNCATE.

Unfortunately, this does add a bit of overhead. That said, for your use case you may be able to just leave the view in place indefinitely, which would simplify things a bit.

这篇关于除了覆盖表格之外,是否还有其他更新Big Query中的行的方法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆