合并重叠间隔并跟踪BigQuery SQL中的最大值 [英] Merge Overlapping Intervals and Track Maximum Value in BigQuery SQL
问题描述
我正在尝试解决一个问题,我想合并给定列ID的重叠间隔,但我也想跟踪每个重叠间隔的最大值.每个间隔都有start_time和stop_time,每个间隔都有一个与之相关的层次结构/优先级.
I am trying to solve a problem where i want to merge overlapping intervals for a given column id, but i also want to track the maximum value for each overlapped interval. I have start_time and stop_time for each interval and each interval has a hierarchy/priority associated with it.
这些是表格中的以下列: id,start_time,stop_time,some_value
These are the following columns in the table: id, start_time, stop_time, some_value
示例输入:
示例输出:
推荐答案
下面是针对BigQuery Standard SQL的,我假设您仍在处理与上一个问题相同的用例,因此我想使其与该解决方案保持一致-并且您可以在需要考虑优先级的情况下将其扩展,例如
Below is for BigQuery Standard SQL and I assume you stll working on the same use-case as in previous question, so I wanted to keep it inline with that solution - and you can extend it for when you also want to account for priorities for example
所以,无论如何:
#standardSQL
WITH check_times AS (
SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
SELECT id, stop_time AS TIME FROM `project.dataset.table`
), distinct_intervals AS (
SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
FROM check_times
), deduped_intervals AS (
SELECT a.id, a.start_time, a.stop_time, MAX(some_value) some_value
FROM distinct_intervals a
JOIN `project.dataset.table` b
ON a.id = b.id
AND a.start_time BETWEEN b.start_time AND b.stop_time
AND a.stop_time BETWEEN b.start_time AND b.stop_time
GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, MAX(some_value) some_value
FROM (
SELECT id, start_time, stop_time, some_value, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
FROM (
SELECT id, start_time, stop_time, some_value,
start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
FROM deduped_intervals
)
)
GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time
如果要应用于您的样本数据-结果为
If to apply to your sample data - result is
Row id start_time stop_time some_value
1 1 0 36 50
2 1 41 47 23
是否可以在结果中再增加一列,以显示该时间段内的事件数
Is it possible to add one more column to the result which will show number of events during that time period
#standardSQL
WITH check_times AS (
SELECT id, start_time AS TIME FROM `project.dataset.table` UNION DISTINCT
SELECT id, stop_time AS TIME FROM `project.dataset.table`
), distinct_intervals AS (
SELECT id, TIME AS start_time, LEAD(TIME) OVER(PARTITION BY id ORDER BY TIME) stop_time
FROM check_times
), deduped_intervals AS (
SELECT a.id, a.start_time, a.stop_time, MAX(some_value) some_value, ANY_VALUE(To_JSON_STRING(b)) event_hash
FROM distinct_intervals a
JOIN `project.dataset.table` b
ON a.id = b.id
AND a.start_time BETWEEN b.start_time AND b.stop_time
AND a.stop_time BETWEEN b.start_time AND b.stop_time
GROUP BY a.id, a.start_time, a.stop_time
), combined_intervals AS (
SELECT id, MIN(start_time) start_time, MAX(stop_time) stop_time, MAX(some_value) some_value, COUNT(DISTINCT event_hash) events
FROM (
SELECT *, COUNTIF(flag) OVER(PARTITION BY id ORDER BY start_time) grp
FROM (
SELECT *,
start_time != IFNULL(LAG(stop_time) OVER(PARTITION BY id ORDER BY start_time), start_time) flag
FROM deduped_intervals
)
)
GROUP BY id, grp
)
SELECT *
FROM combined_intervals
-- ORDER BY id, start_time
有结果
Row id start_time stop_time some_value events
1 1 0 36 50 8
2 1 41 47 23 1
这篇关于合并重叠间隔并跟踪BigQuery SQL中的最大值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!