Google BQ - 如何插入表格中的现有数据? [英] Google BQ - how to upsert existing data in tables?

查看:200
本文介绍了Google BQ - 如何插入表格中的现有数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Python客户端库在BigQuery表中加载数据。我需要更新这些表中的一些更改的行。但我无法弄清楚如何正确更新它们?我想要一些类似的 UPSERT 函数 - 只在行不存在的情况下插入行,否则 - 更新现有的行



在表格中使用带有校验和的特殊字段的正确方法(并在加载过程中比较总和)?如果有一个好主意,如何用Python客户端解决这个问题? (据我所知,它不能更新现有的数据)



请解释一下,最佳做法是什么?

解决方案

BigQuery是由设计append-only首选的。这意味着你最好让表中的同一个实体重复行,并编写你的查询来总是读最近的行。



更新行,你知道在事务表中是在BQ中不可能。每桌每天只有100次更新。这是非常有限的,他们的目的是完全不同的。

因为BQ被用作数据湖,所以每当用户更新他们的配置文件时,你应该只是流新的行。您最终会为同一用户节省20行20行。稍后,您可以通过删除重复数据来重新表达您的表以获得唯一的行。



查看后面的最多问题: BigQuery - 删除重复的语句


I'm using Python client library for loading data in BigQuery tables. I need to update some changed rows in those tables. But I couldn't figure out how to correctly update them? I want some similar UPSERT function - insert row only if its not exists, otherwise - update existing row.

Is it the right way to use a special field with checksum in tables (and compare sum in loading process)? If there is a good idea, how to solve this with Python client? (As I know, it can't update existing data)

Please explain me, what's the best practice?

解决方案

BigQuery is by design append-only preferred. That means that you better let duplicate rows from the same entity in the table and write your queries to always read most recent row.

Updating rows as you know in transactional tables is not possible in BQ. You have only 100 updates per table per day. That's very limited and their purpose is totally different.

Since BQ is used as data lake, you should just stream new rows every time the user eg: updates their profile. You will end up having from 20 saves 20 rows for the same user. Later you can rematerilize your table to have unique rows by removing duplicate data.

See the most question for the later: BigQuery - DELETE statement to remove duplicates

这篇关于Google BQ - 如何插入表格中的现有数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆