避免在 bigquery 中重复 [英] Avoid duplicates in bigquery

查看:28
本文介绍了避免在 bigquery 中重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 bigquery,并且在文档中说:

i'm working with bigquery and in the documentation it's said:

与传统的 RDBMS 不同,没有主/次或行 ID 键.如果需要,请在表架构中标识一列那个目的.

Unlike a traditional RDBMS, there is no notion of primary/secondary or row-id keys. If required, identify a column in the table schema for that purpose.

你知道我怎么能像 pk 那样插入而不重复(但不仅在同一个插入中)?问候

Do you know how could i insert without duplicates as a pk would do(but not only in the same insert)? Regards

推荐答案

所以让我们首先澄清一些事实.因为没有重复就不能插入.

So let's clear some fact right in the first place. As you cannot insert without duplicates.

Bigquery 是一种适用于大型数据集的托管数据仓库,它是免费的,不能替代传统数据库.

Bigquery is a managed data warehouse suitable for large datasets, and it's complimentary not a replacement for traditional databases.

您每天最多只能对一个表执行 96 次 DML(更新、删除)操作.这是设计使然.它很低,因为它迫使您像数据湖一样思考 BQ.

You can only do a maximum of 96 DML (update,delete) operations on a table per day. This is by design. It's low because it forces you to think like as BQ as a data lake.

因此,在 Bigquery 上,您实际上让每个数据都进入了,一切都是按设计附加的.这意味着按照设计,您有一个数据库,每次更新都会保存一个新行.因此,如果您想使用最新的数据,您需要选择最后一行并使用它.

So on Bigquery you actually let every data in, everything is append-only by design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.

我们实际上利用了我们添加到同一行的每个新更新的见解.例如,我们可以检测最终用户在注册流程中选择其国家/地区所需的时间.因为我们有一个国家下拉列表,它需要一些时间才能滚动到他的国家,并且指标显示了这一点,因为我们最终在 BQ 中有两行,选择了一个之前的国家,然后选择了一个国家,并且基于时间选择我们能够优化流程.现在,在我们的国家/地区下拉列表中,我们列出了前 5 个最近/最常出现的国家/地区,因此这些用户不再需要滚动并选择一个国家/地区,速度更快.

We actually leverage insights from every new update we add to the same row. Like for example we can detect how long did take for the end-user to choose it's country at signup flow. Because we have a dropdown of countries it took some time until it scrolled to his country, and metrics shown this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country, it's faster.

换句话说,您使用流式插入功能不断添加新的行.然后您通常使用 Window Functions 选择最后一行.

In other words you use Streaming Insert functionality to constantly add new rows. Then you have your SQL queries usually with Window Functions to pick last row.

您不能像 BigQuery那样更新行或附加到记录 将 DML 语句限制为每个表 96 个.

You cannot update a row, or append to a record as BigQuery limits DML statements to 96 per table.

这篇关于避免在 bigquery 中重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆