除了覆盖表之外,还有其他方法可以更新 Big Query 中的行吗? [英] Is there any other approach for updating a row in Big Query apart from overwriting the table?

查看:18
本文介绍了除了覆盖表之外,还有其他方法可以更新 Big Query 中的行吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包数据,其中一些字段如下:

I have a package data with some of its fields as following:

packageid-->string
status--->string
status_type--->string
scans--->record(repeated)
     scanid--->string
     status--->string
scannedby--->string

每天,我有 100 000 个包裹的数据.每天的总包数据大小变为 100 MB(大约),1 个月变为 3GB.对于每个包,可以进行 3-4 次更新.那么每次包更新(例如,只是状态字段的更改)时,我是否必须覆盖包表?

Per day, I have a data of 100 000 packages. Total package data size per day becomes 100 MB(approx) and for 1 month it becomes 3GB. For each package, 3-4 updates can come. So do I have to overwrite the package table, every time a package update (e.g. just a change in status field) comes?

假设我在表中有 3 个包的数据,现在第二个包的更新来了,我是否必须覆盖整个表(删除和添加整个数据需要 2 个事务每个包更新)?对于 100 000 个包裹,总交易量将为 10^5 * 10^5 * 2/2.

Suppose I have data of 3 packages in the table and now the update for 2nd package comes, do I have to overwrite the whole table (deleting and adding the whole data takes 2 transaction per package update)? For 100 000 packages, total transactions will be 10^5 * 10^5 * 2/2.

还有其他方法可以在不覆盖表的情况下进行原子更新吗?(如果表包含 100 万个条目,然后包更新来了,那么覆盖整个表将是一个开销.)

Is there any other approach for atomic updates without overwriting the table? (as if the table contains 1 million entries and then a package update comes, then overwriting the whole table will be an overhead.)

推荐答案

目前没有办法更新单个行.我们确实经常看到这个用例,我们推荐类似于米哈伊尔建议的东西.基本上,如果您有一个逻辑行的唯一 ID,以及行数据的更新时间的时间戳,您可以简单地将每个更新添加为新行,并在表上应用视图以提供所需的行.

Currently there is no way to update individual rows. We do see this use case somewhat often, and we recommend something similar to what Mikhail suggested. Basically, if you have some unique ID for a logical row, and a timestamp of the update time to the row data, you can simply add every update as a new row, and apply a view over the table to give you the desired rows.

您的视图将如下所示:

SELECT *
FROM (
  SELECT
      *,
      MAX(<timestamp_column>)
          OVER (PARTITION BY <id_column>)
          AS max_timestamp,
  FROM <table>
)
WHERE <timestamp_column> = max_timestamp

(摘自此处仅返回 BigQuery 表中包含重复项的最新行)

如果你的表被分区成日常表(或者一段时间后变成静态的),你可以在表稳定后用视图查询的结果替换视图,提高查询效率.

If your table is partitioned into daily tables (or becomes static after some period), you can then replace the view with the result of the view query after the table stabilizes, and improve your query efficiency.

例如

  • 将数据添加到 TABLE_RAW.
  • 创建对 TABLE_RAW 执行上述查询的视图 TABLE
  • 在 TABLE_RAW 稳定后的某个时间点,使用目标表 TABLE 查询 TABLE,写入处置为 WRITE_TRUNCATE.

不幸的是,这确实增加了一些开销.也就是说,对于您的用例,您也许可以无限期地保留视图,这会简化一些事情.

Unfortunately, this does add a bit of overhead. That said, for your use case you may be able to just leave the view in place indefinitely, which would simplify things a bit.

这篇关于除了覆盖表之外,还有其他方法可以更新 Big Query 中的行吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆