BigQuery - 删除重复记录有时需要很长时间 [英] BigQuery - removing duplicate records sometimes taking long

查看:33
本文介绍了BigQuery - 删除重复记录有时需要很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在云中实现了以下 ETL 过程:每小时在本地数据库中运行一次查询 => 将结果保存为 csv 并将其加载到云存储中 => 将文件从云存储加载到 BigQuery 表中 => 使用删除重复记录以下查询.

We implemented following ETL process in Cloud: run a query in our local database hourly => save the result as csv and load it into the cloud storage => load the file from cloud storage into BigQuery table => remove duplicate records using the following query.

SELECT 
  * EXCEPT (row_number)
FROM (
  SELECT 
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number 
  FROM rawData.stock_movement
)
WHERE row_number = 1

从今天早上8点(柏林当地时间)开始,删除重复记录的过程比平时花费的时间要长得多,甚至数据量也和平时没有太大区别:删除重复记录通常需要10s而今天早上有时是半小时.

Since 8 am (local time in Berlin) this morning the process of removing duplicate records takes much longer than it usual does, even the amount of data is not much different than it usual is: it takes usually 10s to remove duplicate records whereas this morning sometimes half an hour.

去除重复记录的性能不稳定吗?

Is it the performance to remove duplicate record not stable?

推荐答案

对于特定的 id 可能有许多重复值,因此计算行号需要很长时间.如果您想检查是否是这种情况,您可以尝试:

It could be that you have many duplicate values for a particular id, so computing row numbers takes a long time. If you want to check for whether this is the case, you can try:

#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;

话虽如此,使用此查询删除重复项可能会更快:

With that said, it may be faster to remove duplicates with this query instead:

#standardSQL
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);

这是一个例子:

#standardSQL
WITH T AS (
  SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
  SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
  SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);

这可能更快的原因是 BigQuery 只会在任何给定时间点将时间戳最大的行保留在内存中.

The reason that this may be faster is that BigQuery will only keep the row with the largest timestamp in memory at any given point in time.

这篇关于BigQuery - 删除重复记录有时需要很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆