在bigquery中避免重复 [英] Avoid duplicates in bigquery

查看:74
本文介绍了在bigquery中避免重复的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用bigquery,并在文档中说:


与传统的RDBMS不同,不存在主/二级或
行ID密钥。如果需要的话,在表模式中为
标识一个列。


你知道我怎么插入而没有重复作为一个PK会做(但不只在同一个插入)?
Regards

解决方案

所以,让我们先澄清一些事实吧。因为您不能插入重复项。



Bigquery是适用于大型数据集的托管数据仓库,,它不是传统数据库的替代品

您每天最多只能在表格上执行96次DML(更新,删除)操作。这是设计。它很低,因为它会迫使你像BQ一样认为它是一个数据湖。



因此,在Bigquery中,实际上每一个数据都是通过设计添加的。这意味着通过设计,您可以拥有一个数据库,每次更新都会保留一个新行。因此,如果您想使用最新的数据,您需要选择最后一行并使用它。



实际上,我们利用了每次添加到同一行的新更新。例如,我们可以检测到最终用户在注册流程中选择它的国家需要多长时间。因为我们有一个国家的下拉直到滚动到他的国家需要一些时间,并且指标显示了这一点,因为我们在BQ结束了两行,选择了一个先前的国家,选择了一个国家并根据时间选择我们是能够优化流程。现在在我们的国家下拉菜单中,我们列出了前5个最近/频繁出现的国家/地区,因此这些用户不再需要滚动并选择一个国家,速度更快。

换句话说,您可以使用流式插入功能不断添加新行。然后,您通常使用窗口函数选择最后一行您无法更新行或将记录追加到BigQuery 将DML语句限制为每张表96个。


i'm working with bigquery and in the documentation it's said:

Unlike a traditional RDBMS, there is no notion of primary/secondary or row-id keys. If required, identify a column in the table schema for that purpose.

Do you know how could i insert without duplicates as a pk would do(but not only in the same insert)? Regards

解决方案

So let's clear some fact right in the first place. As you cannot insert without duplicates.

Bigquery is a managed data warehouse suitable for large datasets, and it's complimentary not a replacement for traditional databases.

You can only do a maximum of 96 DML (update,delete) operations on a table per day. This is by design. It's low because it forces you to think like as BQ as a data lake.

So on Bigquery you actually let every data in, everything is append-only by design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.

We actually leverage insights from every new update we add to the same row. Like for example we can detect how long did take for the end-user to choose it's country at signup flow. Because we have a dropdown of countries it took some time until it scrolled to his country, and metrics shown this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country, it's faster.

In other words you use Streaming Insert functionality to constantly add new rows. Then you have your SQL queries usually with Window Functions to pick last row.

You cannot update a row, or append to a record as BigQuery limits DML statements to 96 per table.

这篇关于在bigquery中避免重复的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆