通过在没有导入/导出的情况下加入 Google BigQuery 上的 2 个表来删除/更新表条目 [英] Delete/update table entries by joining 2 tables on Google BigQuery without import/export

查看:17
本文介绍了通过在没有导入/导出的情况下加入 Google BigQuery 上的 2 个表来删除/更新表条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个用例,其中一个表中有数亿个条目,但在将其进一步拆分时遇到了问题.99% 的操作都是附加的.但是,我们偶尔会进行更新和删除,Google 本身表示只有通过删除表格并使用最新数据创建新表格才能进行更新和删除.

We have a usecase where we have hundreds of millions of entries in a table and have a problem splitting it up further. 99% of operations are append-only. However, we have occasional updates and deletes which Google itself says is only possible by deleting a table and creating a new one with the latest data.

因为数据量很大,我们希望在 30 秒左右的时间内更新表格,所以我们考虑了将原始表格刷新表格连接起来的可能性> 以某种方式,我们只有出现在原始表中而不出现在刷新表中的条目(删除的情况)或使用刷新表中的数据写入项目(如果找到)(更新的情况).输出/目标应该是新表,然后我们将使用 WRITE_TRUNCATE(覆盖)将其复制回原始表.如果更新看起来太复杂,我们可以使用仅删除逻辑并自己重新插入更新的项目.

Because it's a lot of data and we would wish to update the tables within 30 seconds or so, we thought about the possibility of joining an Original table with a Refresher Table in a way that we only have entries that appear in Original Table but not in Refresher Table (case of delete) or write items with data from Refresher Table if found (case of update). Output/target should be New Table, which we would then copy back to Original Table with WRITE_TRUNCATE (overwrite). If update seems to be too complex, we could live with a delete-only logic and re-insert the updated items ourselves.

这可能吗?什么类型的连接似乎最合适?我们会将我们的更新流式插入刷新表并定期清理原始表.我们不必为重新插入整个原始表(无论是时间还是金钱)付费,而只需为查询"一次以及将少数流式插入插入到更新表中.

Is this possible? What type of join seem to be a best fit? We'd stream insert our updates into the Refresher Table and periodically clean up the Original table. We would not have to pay for re-inserting the whole Original table (whether its time or money) but only for querying' once and those few streaming inserts to update-table.

在定期合并发生之前,我们可以忍受查询陈旧数据.我们还可以在维护期间暂时停止查询.

欢迎提出任何想法.

推荐答案

所以要在我的评论中添加更多内容:

So to add more on my comment:

你为什么不只接受更新作为表格中的新行,并且有只读取表中最后一行的查询吗?就这么多更容易.

Why don't you just accept the updates as a new row in your table, and have queries that read only the last row from the table? That's much easier.

创建一个这样的视图:

select * from (
SELECT 
rank() over (partition by user_id order by timestamp desc) as _rank,
*
FROM [db.userupdate_last] 
) where _rank=1

并更新您的查询以查询视图表和基本表,您就完成了.

and update your queries to query the view table and your basic table and you are done.

我们如何使用它的一些上下文.我们有一个保存用户配置文件数据的事件表.每次更新时,我们都会在 BQ 中再次附加完整的配置文件数据行.这意味着我们最终拥有一个版本化的内容,该内容的行数与他们完成的更新次数一样多.这都在同一个表中,通过查看时间我们知道更新的顺序.假设表我们:[userupdate].如果我们做一个

Some context how we use this. We have an events table that hold user profile data. On every update we append the complete profile data row again in BQ. That means that we end up having a versioned content with as many rows for that user_id as how many updates they have done. This is all in the same table, and by looking at the time we know the order of the updates. Let's say the table us: [userupdate]. If we do a

select * from userupdate where user_id=10

它将以随机顺序将此用户所做的所有更新返回到他们的个人资料中.

it will return all updates made by this user to their profile in random order.

但是我们创建了一个视图,我们只创建了一次,语法如上.现在当我们:

But we created a view, which we created only once, and the syntax is above. And now when we:

select * from userupdate_last where user_id=10 #notice the table name changed to view name

它只会返回 1 行,即用户的最后一行.如果我们想从包含一堆仅附加行的表中查询,我们只是将表名交换为视图名.

it will return only 1 row, the last row of the user. And we have queries where we just swap the table name to view name, if we want to query from a table holding a bunch of append only rows only the last one.

这篇关于通过在没有导入/导出的情况下加入 Google BigQuery 上的 2 个表来删除/更新表条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆