BigQuery - 删除重复记录有时花费很长时间 [英] BigQuery - removing duplicate records sometimes taking long

查看:206
本文介绍了BigQuery - 删除重复记录有时花费很长时间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们在Cloud中实现了以下ETL过程:每小时在本地数据库中运行查询=>将结果保存为csv并将其加载到云存储中=>将文件从云存储加载到BigQuery表中=>使用删除重复记录以下查询。

  SELECT 
* EXCEPT(row_number)
FROM(
SELECT
* ,
ROW_NUMBER()OVER(PARTITION BY ID ORDER BY timestamp DESC)row_number
FROM rawData.stock_movement

WHERE row_number = 1

自今早8点(当地时间柏林)以来,删除重复记录的过程比平时花费的时间要长得多,即使数据量与平时相比,没有多大区别:通常需要10秒才能删除重复记录,而今天早上有时需要半小时。



是否删除重复记录不稳定?

解决方案

这可能是因为您为特定的 id ,所以计算行号需要很长时间。如果你想检查是否是这种情况,你可以试试:

  #standardSQL 
SELECT id,COUNT (*)AS id_count
FROM rawData.stock_movement
GROUP BY ID
ORDER BY id_count DESC LIMIT 5;

就是说,使用此查询删除重复项可能会更快:

  #standardSQL 
SELECT latest_row。*
FROM(
SELECT ARRAY_AGG(t ORDER BY BY timestamp DESC LIMIT 1) [OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);

以下是一个例子:

<$ p $ #standardSQL
WITH T AS(
SELECT 1 AS id,'foo'AS x,TIMESTAMP'2017-04-01'AS timestamp UNION ALL
SELECT 2,'bar',TIMESTAMP'2017-04-02'UNION ALL
SELECT 1,'baz',TIMESTAMP'2017-04-03')
SELECT latest_row。*
FROM(
SELECT ARRAY_AGG(t t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
FROM rawData.stock_movement AS t
GROUP BY t.id
);

这可能更快的原因是BigQuery只会保留内存中具有最大时间戳的行在任何时间点。

We implemented following ETL process in Cloud: run a query in our local database hourly => save the result as csv and load it into the cloud storage => load the file from cloud storage into BigQuery table => remove duplicate records using the following query.

SELECT 
  * EXCEPT (row_number)
FROM (
  SELECT 
    *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY timestamp DESC) row_number 
  FROM rawData.stock_movement
)
WHERE row_number = 1

Since 8 am (local time in Berlin) this morning the process of removing duplicate records takes much longer than it usual does, even the amount of data is not much different than it usual is: it takes usually 10s to remove duplicate records whereas this morning sometimes half an hour.

Is it the performance to remove duplicate record not stable?

解决方案

It could be that you have many duplicate values for a particular id, so computing row numbers takes a long time. If you want to check for whether this is the case, you can try:

#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;

With that said, it may be faster to remove duplicates with this query instead:

#standardSQL
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);

Here is an example:

#standardSQL
WITH T AS (
  SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
  SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
  SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);

The reason that this may be faster is that BigQuery will only keep the row with the largest timestamp in memory at any given point in time.

这篇关于BigQuery - 删除重复记录有时花费很长时间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆