合并重叠间隔并跟踪BigQuery SQL中的最大值 [英] Merge Overlapping Intervals and Track Maximum Value in BigQuery SQL

查看:61
本文介绍了合并重叠间隔并跟踪BigQuery SQL中的最大值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解决一个问题,我想合并给定列ID的重叠间隔,但我也想跟踪每个重叠间隔的最大值.每个间隔都有start_time和stop_time,每个间隔都有一个与之相关的层次结构/优先级.

I am trying to solve a problem where i want to merge overlapping intervals for a given column id, but i also want to track the maximum value for each overlapped interval. I have start_time and stop_time for each interval and each interval has a hierarchy/priority associated with it.

这些是表格中的以下列: id,start_time,stop_time,some_value

These are the following columns in the table: id, start_time, stop_time, some_value

示例输入:

示例输出:

推荐答案

下面是针对BigQuery Standard SQL的,我假设您仍在处理与上一个问题相同的用例,因此我想使其与该解决方案保持一致-并且您可以在需要考虑优先级的情况下将其扩展,例如

Below is for BigQuery Standard SQL and I assume you stll working on the same use-case as in previous question, so I wanted to keep it inline with that solution - and you can extend it for when you also want to account for priorities for example

所以,无论如何:

#standardSQL
WITH check_times AS (
  SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
  SELECT id, stop_time AS TIME FROM `project.dataset.table` 
), distinct_intervals AS (
  SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
  FROM check_times
), deduped_intervals AS (
  SELECT a.id, a.start_time, a.stop_time, MAX(some_value) some_value 
  FROM distinct_intervals a
  JOIN `project.dataset.table` b
  ON a.id = b.id 
  AND a.start_time BETWEEN b.start_time AND b.stop_time 
  AND a.stop_time BETWEEN b.start_time AND b.stop_time
  GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
  SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, MAX(some_value) some_value 
  FROM (
    SELECT id, start_time, stop_time, some_value, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
    FROM (
      SELECT id, start_time, stop_time, some_value,
        start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
      FROM deduped_intervals
    )
  )
  GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time

如果要应用于您的样本数据-结果为

If to apply to your sample data - result is

Row id  start_time  stop_time   some_value   
1   1   0           36          50   
2   1   41          47          23    

是否可以在结果中再增加一列,以显示该时间段内的事件数

Is it possible to add one more column to the result which will show number of events during that time period

#standardSQL
WITH check_times AS (
  SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
  SELECT id, stop_time AS TIME FROM `project.dataset.table` 
), distinct_intervals AS (
  SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
  FROM check_times
), deduped_intervals AS (
  SELECT a.id, a.start_time, a.stop_time, MAX(some_value) some_value, ANY_VALUE(To_JSON_STRING(b)) event_hash
  FROM distinct_intervals a
  JOIN `project.dataset.table` b
  ON a.id = b.id 
  AND a.start_time BETWEEN b.start_time AND b.stop_time 
  AND a.stop_time BETWEEN b.start_time AND b.stop_time
  GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
  SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, MAX(some_value) some_value, COUNT(DISTINCT event_hash) events
  FROM (
    SELECT *, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
    FROM (
      SELECT *,
        start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
      FROM deduped_intervals
    )
  )
  GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time

有结果

Row id  start_time  stop_time   some_value  events   
1   1   0           36          50          8    
2   1   41          47          23          1    

这篇关于合并重叠间隔并跟踪BigQuery SQL中的最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆