拒绝对现有数据的 BigQuery 数据加载尝试 [英] Reject data load attempt to BigQuery for existing data

查看:27
本文介绍了拒绝对现有数据的 BigQuery 数据加载尝试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 pandas-gbq 包将数据从 Pandas 数据帧加载到 BigQuery:

df.to_gbq('dataset.table', project_id, reauth=False, if_exists='append')

典型的数据框如下所示:

key |价值 |命令"sd3e" |0.3 |1"sd3e" |0.2 |2"sd4r" |0.1 |1"sd4r" |0.5 |2

如果键已经出现在 BigQuery 表中,有没有办法拒绝加载尝试?

解决方案

如果键已经出现在 BigQuery 表中,是否有办法拒绝加载尝试?

不,因为 BigQuery 不像其他数据库那样支持键.有两个典型的用例可以解决这个问题:

选项 1:
上传带有时间戳的数据并使用合并命令删除重复项

查看此

注意:这种情况不会产生额外费用,但是,从表中获取时,您始终必须确保使用窗口函数

I'm loading data from pandas dataframes to BigQuery using pandas-gbq package:

df.to_gbq('dataset.table', project_id, reauth=False, if_exists='append')

A typical dataframe looks like:

key      |    value    |    order
"sd3e"   |     0.3     |    1
"sd3e"   |     0.2     |    2
"sd4r"   |     0.1     |    1
"sd4r"   |     0.5     |    2

Is there a way to reject the loading attemp if the key already appears in the BigQuery table?

解决方案

Is there a way to reject the loading attempt if the key already appears in the BigQuery table?

No, since BigQuery doesn't support keys in a similar way other database does. There are 2 typical use-cases to solve this:

Option 1:
Upload the data with a timeStamp and use a merge command to remove duplicates

See this link on how to do this, This is an example

MERGE `DATA` AS target
USING `DATA` AS source
ON target.key = source.key
WHEN MATCHED AND target.ts < source.ts THEN 
DELETE

Note: In this case, you pay for the merge scanning but keep your table row unique.

Option 2:

Upload the data with a timestamp and use ROW_NUMBER window function to fetch the latest record, This is an example with your data:

WITH DATA AS (
    SELECT 'sd3e' AS key, 0.3 as value,  1 as r_order, '2019-04-14 00:00:00' as ts  UNION ALL
    SELECT 'sd3e' AS key, 0.2 as value,  2 as r_order, '2019-04-14 01:00:00' as ts  UNION ALL
    SELECT 'sd4r' AS key, 0.1 as value,  1 as r_order, '2019-04-14 00:00:00' as ts  UNION ALL
    SELECT 'sd4r' AS key, 0.5 as value,  2 as r_order, '2019-04-14 01:00:00' as ts  
)

SELECT * 
FROM (
    SELECT * ,ROW_NUMBER() OVER(PARTITION BY key order by ts DESC) rn 
    FROM `DATA` 
)
WHERE rn = 1

This produces the expected results as follow:

Note: This case doesn't incur extra charges, however, you always have to make sure to use window function when fetching from the table

这篇关于拒绝对现有数据的 BigQuery 数据加载尝试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆