BigQuery中的行级原子MERGE REPLACE [英] Row level atomic MERGE REPLACE in BigQuery
问题描述
对于我的用例,我正在使用可通过源处的唯一键标识的数据分解为n个(不确定的)数量的目标条目,这些条目已加载到BigQuery表中以进行分析.
For my use case I'm working with data identifiable by unique key at the source exploded into n (non deterministic) number of target entries loaded into BigQuery tables for analytic purposes.
构建此ETL以使用Mongo最近的Change Stream功能,我想删除BigQuery中的所有条目,然后自动加载新条目.
Building this ETL to use Mongo recent Change Stream feature I would like to drop all entries in BigQuery and then load the new entries atomically.
探索BigQuery DML,我看到了 MERGE 支持a>操作,但只能使用WHEN MATCHED THEN DELETE
或WHEN MATCHED THEN UPDATE
.
Exploring BigQuery DML I see a MERGE operation is supported, but only WHEN MATCHED THEN DELETE
or WHEN MATCHED THEN UPDATE
is possible.
我对何时匹配然后删除并按照INSERT 操作感兴趣.
我如何在BigQuery中实现这样的ETL,同时在数据可用性和正确性方面保持原子性或最终一致性.
How would I implement such ETL in BigQuery while remaining atomic or eventually consistent as possible in terms of data availability and correctness.
我想提供一个具体的例子来详细说明.
EDIT 1: I would like to provide a concrete example to elaborate.
我在此数据集上具有的最小唯一性粒度为user_id
.行不是唯一可识别的.
The lowest granularity of uniqueness I have on this dataset is user_id
. Rows are not uniquely identifiable.
示例
从mongo更改流收到的更新的用户对象:
Updated user object received from mongo change stream:
user={_id: "3", name="max", registered="2018-07-05" q=["a", "b", "c"]}
2.
当前拥有BigQuery.user_q
2.
Current BigQuery.user_q holds
| user_id | q |
...
| 3 | a |
| 3 | b |
...
3.
转换代码将修改后的用户对象加载到BigQuery.user_q_incoming
3.
Transform code loads modified user object into BigQuery.user_q_incoming
| user_id | q |
| 3 | a |
| 3 | b |
| 3 | c |
4.
user_q
和user_q_incoming
之间的合并:
4.
MERGE between user_q
and user_q_incoming
:
- 2行已删除 插入了
- 属于
user_id 3
的user_q_incoming
中的3行. -
user_q
中的其余数据(...
)保留在原处,未经修改.
user_q
中属于user_id 3
的- 2 rows in
user_q
that belong touser_id 3
are DELETED - 3 rows in
user_q_incoming
that belong touser_id 3
are INSERTED. - Rest of the data (
...
) inuser_q
is left in place, unmodified.
5.
BigQuery.user_q持有
5.
BigQuery.user_q holds
| user_id | q |
...
| 3 | a |
| 3 | b |
| 3 | c |
...
例如,用户可以从个人资料中删除问题.将其余行保留为q=["a", "c"]
.我还需要将其转换为BigQuery世界视图.
For example user might delete a question from his profile. Leaving the remaining rows to be q=["a", "c"]
. I need this to translate into the BigQuery world view as well.
推荐答案
There is a similar question and one walk-around to make MERGE work (https://issuetracker.google.com/issues/35905927#comment9).
基本上,下面的事情应该起作用,
Basically, something like following should work,
MERGE `project.dataset.user_q` T
USING (
SELECT *, false AS is_insert FROM `project.dataset.user_q_incoming`
UNION ALL
SELECT *, true AS is_insert FROM `project.dataset.user_q_incoming`
) S
ON T.user_id = S.user_id and NOT is_insert
WHEN MATCHED THEN
DELETE
WHEN NOT MATCHED AND is_insert THEN
INSERT(user_id, q) VALUES(user_id, q)
理想情况下,以下是您所需要的,但尚不支持.
Ideally, following is what you need, but it's not supported yet.
MERGE `project.dataset.user_q`
USING `project.dataset.user_q_incoming`
ON FALSE
WHEN NOT MATCHED BY TARGET THEN
INSERT(user_id, q) VALUES(user_id, q)
WHEN NOT MATCHED BY SOURCE AND user_id in (SELECT user_id FROM `project.dataset.user_q_incoming`) THEN
DELETE
这篇关于BigQuery中的行级原子MERGE REPLACE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!