无需导入/导出,即可通过Google BigQuery上的两个表格删除/更新表格条目 [英] Delete/update table entries by joining 2 tables on Google BigQuery without import/export

查看:122
本文介绍了无需导入/导出,即可通过Google BigQuery上的两个表格删除/更新表格条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个用例,我们在表中有数以亿计的条目,并有问题进一步分裂。 99%的操作只能追加。但是,我们偶尔会更新和删除Google自己说的只能删除一个表格并用最新数据创建一个新表格。



因为它有很多数据我们希望在30秒左右的时间内更新表格,我们考虑了以进度表加入原始表的可能性,以便我们只有条目出现在原始表格中但不在刷新表格中(删除的情况),或者如果找到(写入更新的情况),则从写有刷新表格的数据写入项目。输出/目标应该是新表,然后我们将使用WRITE_TRUNCATE(覆盖)将其复制回原始表。如果更新似乎过于复杂,我们可以使用只删除逻辑并自己重新插入更新的项目。



这可能吗?什么类型的连接似乎是最合适的?我们会将我们的更新插入到刷新表并定期清理原始表。我们不必为重新插入整个原始表格(无论是时间还是金钱)而付费,而只是为了查询'一次和那几次数据流插入到更新表格。



<编辑:我们可以生活在查询陈旧的数据,直到定期合并发生。我们还可以在维护期间暂停查询。



任何想法都欢迎。

解决方案

所以,在我的评论中添加更多内容: 为什么你不接受更新作为表中的新行,而
的查询只读取表中的最后一行?

创建一个这样的视图:




< pre $ select * from(
SELECT
rank()over(partition by user_id order by timestamp desc)as _rank,
*
FROM [db.userupdate_last]
)其中_rank = 1

并更新您的查询以查询查看表和你的基本表,你完成了。



我们如何使用这个的一些上下文。我们有一个保存用户配置文件数据的事件表。在每次更新时,我们都会在BQ中再次添加完整的配置文件数据行。这意味着我们最终拥有一个版本化的内容,其中的user_id的行数与他们完成的更新次数一样多。这一切都在同一张表中,并通过查看我们知道更新顺序的时间。我们来说说我们的表格:[userupdate]。如果我们做了一个

  select * from userupdate where user_id = 10 

它会以随机顺序将此用户所做的所有更新返回给他们的个人资料。



但是我们创建了一个视图,我们只创建一次,语法如上。现在,当我们:

  select * from userupdate_last where user_id = 10 #notice表名更改为视图名称

它将只返回1行,即用户的最后一行。如果我们想从表中查询,只需将最后一行追加一行即可。


We have a usecase where we have hundreds of millions of entries in a table and have a problem splitting it up further. 99% of operations are append-only. However, we have occasional updates and deletes which Google itself says is only possible by deleting a table and creating a new one with the latest data.

Because it's a lot of data and we would wish to update the tables within 30 seconds or so, we thought about the possibility of joining an Original table with a Refresher Table in a way that we only have entries that appear in Original Table but not in Refresher Table (case of delete) or write items with data from Refresher Table if found (case of update). Output/target should be New Table, which we would then copy back to Original Table with WRITE_TRUNCATE (overwrite). If update seems to be too complex, we could live with a delete-only logic and re-insert the updated items ourselves.

Is this possible? What type of join seem to be a best fit? We'd stream insert our updates into the Refresher Table and periodically clean up the Original table. We would not have to pay for re-inserting the whole Original table (whether its time or money) but only for querying' once and those few streaming inserts to update-table.

EDIT: We can live with querying stale data until the periodic merge took place. We can also halt queries for short amount of time during maintenance.

Any thoughts welcome.

解决方案

So to add more on my comment:

Why don't you just accept the updates as a new row in your table, and have queries that read only the last row from the table? That's much easier.

Create a view like this:

select * from (
SELECT 
rank() over (partition by user_id order by timestamp desc) as _rank,
*
FROM [db.userupdate_last] 
) where _rank=1

and update your queries to query the view table and your basic table and you are done.

Some context how we use this. We have an events table that hold user profile data. On every update we append the complete profile data row again in BQ. That means that we end up having a versioned content with as many rows for that user_id as how many updates they have done. This is all in the same table, and by looking at the time we know the order of the updates. Let's say the table us: [userupdate]. If we do a

select * from userupdate where user_id=10

it will return all updates made by this user to their profile in random order.

But we created a view, which we created only once, and the syntax is above. And now when we:

select * from userupdate_last where user_id=10 #notice the table name changed to view name

it will return only 1 row, the last row of the user. And we have queries where we just swap the table name to view name, if we want to query from a table holding a bunch of append only rows only the last one.

这篇关于无需导入/导出,即可通过Google BigQuery上的两个表格删除/更新表格条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆