大查询重复数据删除查询示例说明 [英] Big Query Deduplication query example explanation

查看:46
本文介绍了大查询重复数据删除查询示例说明的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

任何人都可以解释此Bigquery查询中的重复数据删除吗?为什么我们需要使用[OFFSET(0)]?我认为它是用来获取聚合数组中的第一个元素的吗?与LIMIT 1不一样吗?为什么我们需要汇总整个表?为什么我们可以在一个单元格中汇总整个表?

Anybody can explain this Bigquery query for deduplication? Why do we need to use [OFFSET(0)]? I think it is used to take the first element in aggregation array right? Isn't that the same as LIMIT 1? Why do we need to aggregation the entire table? Why can we aggregate an entire table in a single cell?

 # take the one name associated with a SKU
    WITH product_query AS (
      SELECT 
      DISTINCT 
      v2ProductName,
      productSKU
      FROM `data-to-insights.ecommerce.all_sessions_raw` 
      WHERE v2ProductName IS NOT NULL 
    )
    SELECT k.* FROM (
    # aggregate the products into an array and 
      # only take 1 result
      SELECT ARRAY_AGG(x LIMIT 1)[OFFSET(0)] k 
      FROM product_query x 
      GROUP BY productSKU # this is the field we want deduplicated
    );

推荐答案

让我们从我们要删除重复数据的一些数据开始:

Let's start with some data we want to de-duplicate:

WITH table AS (SELECT * FROM UNNEST([STRUCT('001' AS id, 1 AS a, 2 AS b), ('002', 3,5), ('001', 1, 4)]))

SELECT *
FROM table t

现在,我将使用 t 而不是 * 来引用整行:

Now, instead of *, I'm going to use t to refer to the whole row:

SELECT t
FROM table t

如果我将这些行按其ID分组会发生什么情况

What happens if I group each of these rows by their id:

SELECT t.id, ARRAY_AGG(t) tt
FROM table t
GROUP BY 1

现在,我将具有相同ID的所有行组合在一起.但我只能选择一个:

Now I have all the rows with the same id grouped together. But let me choose only one:

SELECT t.id, ARRAY_AGG(t LIMIT 1) tt
FROM table t
GROUP BY 1

这看起来不错,但是在一个数组中仍然是一行.如何只获取行而不获取数组:

That might look good, but that's still one row inside one array. How can I get only the row, and not an array:

SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
FROM table t
GROUP BY 1

如果我想返回没有分组 id tt 前缀的行:

And if I want to get back a row without the grouping id, nor the tt prefix:

SELECT tt.*
FROM (
  SELECT t.id, ARRAY_AGG(t LIMIT 1)[OFFSET(0)] tt
  FROM table t
  GROUP BY 1
)

这就是您根据行ID对行进行重复删除的方法.

And that's how you de-duplicate rows based on the rows ids.

如果您需要选择特定的行-例如给定时间戳的最新行,只需像 ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)那样对聚合进行排序

If you need to choose a particular row - for example the newest one given a timestamp, just order the aggregation like in ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)

这篇关于大查询重复数据删除查询示例说明的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆