BigQuery中的行级原子MERGE REPLACE [英] Row level atomic MERGE REPLACE in BigQuery

查看:112
本文介绍了BigQuery中的行级原子MERGE REPLACE的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于我的用例,我正在使用可通过源处的唯一键标识的数据分解为n个(不确定的)数量的目标条目,这些条目已加载到BigQuery表中以进行分析.

For my use case I'm working with data identifiable by unique key at the source exploded into n (non deterministic) number of target entries loaded into BigQuery tables for analytic purposes.

构建此ETL以使用Mongo最近的Change Stream功能,我想删除BigQuery中的所有条目,然后自动加载新条目.

Building this ETL to use Mongo recent Change Stream feature I would like to drop all entries in BigQuery and then load the new entries atomically.

探索BigQuery DML,我看到了 MERGE 操作,但只能使用WHEN MATCHED THEN DELETEWHEN MATCHED THEN UPDATE.

Exploring BigQuery DML I see a MERGE operation is supported, but only WHEN MATCHED THEN DELETE or WHEN MATCHED THEN UPDATE is possible.

我对何时匹配然后删除并按照INSERT 操作感兴趣.

我如何在BigQuery中实现这样的ETL,同时在数据可用性和正确性方面保持原子性或最终一致性.

How would I implement such ETL in BigQuery while remaining atomic or eventually consistent as possible in terms of data availability and correctness.

我想提供一个具体的例子来详细说明.

EDIT 1: I would like to provide a concrete example to elaborate.

我在此数据集上具有的最小唯一性粒度为user_id.行不是唯一可识别的.

The lowest granularity of uniqueness I have on this dataset is user_id. Rows are not uniquely identifiable.

示例

从mongo更改流收到的更新的用户对象:

Updated user object received from mongo change stream:

user={_id: "3", name="max", registered="2018-07-05" q=["a", "b", "c"]}

2.

当前拥有BigQuery.user_q

2.

Current BigQuery.user_q holds

| user_id | q |
...
|    3    | a |
|    3    | b |
...

3.

转换代码将修改后的用户对象加载到BigQuery.user_q_incoming

3.

Transform code loads modified user object into BigQuery.user_q_incoming

| user_id | q |
|    3    | a |
|    3    | b |
|    3    | c |

4.

user_quser_q_incoming之间的合并:

4.

MERGE between user_q and user_q_incoming:

    user_q中属于user_id 3
  1. 2行已删除
  2. 插入了
  3. 属于user_id 3user_q_incoming中的3行.
  4. user_q中的其余数据(...)保留在原处,未经修改.
  1. 2 rows in user_q that belong to user_id 3 are DELETED
  2. 3 rows in user_q_incoming that belong to user_id 3 are INSERTED.
  3. Rest of the data (...) in user_q is left in place, unmodified.

5.

BigQuery.user_q持有

5.

BigQuery.user_q holds

| user_id | q |
...
|    3    | a |
|    3    | b |
|    3    | c |
...

例如,用户可以从个人资料中删除问题.将其余行保留为q=["a", "c"].我还需要将其转换为BigQuery世界视图.

For example user might delete a question from his profile. Leaving the remaining rows to be q=["a", "c"]. I need this to translate into the BigQuery world view as well.

推荐答案

有一个类似的问题和一个解决方法,可以使MERGE工作(

There is a similar question and one walk-around to make MERGE work (https://issuetracker.google.com/issues/35905927#comment9).

基本上,下面的事情应该起作用,

Basically, something like following should work,

MERGE `project.dataset.user_q` T
USING (
  SELECT *, false AS is_insert FROM `project.dataset.user_q_incoming`
UNION ALL
  SELECT *, true AS is_insert FROM `project.dataset.user_q_incoming`
) S
ON T.user_id = S.user_id and NOT is_insert
WHEN MATCHED THEN
  DELETE
WHEN NOT MATCHED AND is_insert THEN
  INSERT(user_id, q) VALUES(user_id, q)

理想情况下,以下是您所需要的,但尚不支持.

Ideally, following is what you need, but it's not supported yet.

MERGE `project.dataset.user_q`
USING `project.dataset.user_q_incoming`
ON FALSE
WHEN NOT MATCHED BY TARGET THEN
  INSERT(user_id, q) VALUES(user_id, q)
WHEN NOT MATCHED BY SOURCE AND user_id in (SELECT user_id FROM `project.dataset.user_q_incoming`) THEN
  DELETE

这篇关于BigQuery中的行级原子MERGE REPLACE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆