如何避免和/或清除BigQuery中的重复行? [英] How can I avoid and/or clean duplicated row in BigQuery?
问题描述
当潜在的行重复时,应该如何每天在BigQuery中导入数据?
How should I import data in BigQuery on a daily basis when I have potential duplicated row ?
这里有点上下文.我每天都会从电子表格向BigQuery更新数据.我正在通过简单的WRITE_APPEND方法使用Google App脚本.
Here is a bit of context. I'm updating data on a daily basis from a spreadsheet to BigQuery. I'm using Google App Script with a simple WRITE_APPEND method.
有时我要导入前一天已经导入的数据.所以我想知道如何避免这种情况?
Sometimes I'm importing data I've already imported the day before. So I'm wondering how I can avoid this ?
我可以建立一个SQL查询来每天清除重复行中的表吗?还是有可能甚至在导入之前检测到重复项(例如,在我的工作定义中使用某些特定命令...)?
Can I build a sql query in order to clean my table from duplicate row every day ? Or is this possible to detect duplicate even before importing them (with some specific command in my job definition for example...) ?
谢谢!
推荐答案
- 第1步:准备一个包含要导入数据的工作表
- Step 1: Have a sheet with data to be imported
- 第2步:在BigQuery中将电子表格设置为联合数据源.
- Step 2: Set up your spreadsheet as a federated data source in BigQuery.
- 第3步:使用DML将数据加载到现有表中
- Step 3: Use DML to load data into an existing table
(需要#standardSql)
(requires #standardSql)
#standardSQL
INSERT INTO `fh-bigquery.tt.test_import_native` (id, data)
SELECT *
FROM `fh-bigquery.tt.test_import_sheet`
WHERE id NOT IN (
SELECT id
FROM `fh-bigquery.tt.test_import_native`
)
WHERE id NOT IN (...)
确保仅将具有新ID的行加载到表中.
WHERE id NOT IN (...)
ensures that only rows with new ids are loaded into the table.
这篇关于如何避免和/或清除BigQuery中的重复行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!