BigQuery中重叠间隔的计数 [英] Count of overlapping intervals in BigQuery
问题描述
给定一个间隔表,是否可以有效地查询每个间隔开始时当前打开的间隔的数量(包括当前间隔本身)?
Given a table of intervals, can I efficiently query for the number of currently open intervals at the start of each interval (including the current interval itself)?
例如,给定下表:
start_time end_time
1 10
2 5
3 4
5 6
7 11
19 20
我想要以下输出:
start_time count
1 1
2 2
3 3
5 3
7 2
19 1
在小型数据集上,我可以通过将数据集与其自身结合来解决此问题:
On small datasets, I can solve this problem by joining the dataset against itself:
WITH intervals AS (
SELECT 1 AS start, 10 AS end UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
a.start_time,
count(*)
FROM
intervals a CROSS JOIN intervals b
WHERE
a.start_time >= b.start_time AND
a.start_time <= b.end_time
GROUP BY a.start_time
ORDER BY a.start_time
对于大型数据集,CROSS JOIN既不切实际又不必要,因为任何给定的答案仅取决于少量的先前间隔(按 start_time
排序时).实际上,在我拥有的数据集上,它超时.有没有更好的方法来实现这一目标?
With large datasets the CROSS JOIN is both impractical and unnecessary, since any given answer only depends on a small number of preceding intervals (when sorted by start_time
). In fact, on the dataset that I have, it times out. Is there a better way to achieve this?
推荐答案
... CROSS JOIN既不切实际又不必要......
有没有更好的方法来实现这一目标?
... CROSS JOIN is both impractical and unnecessary ...
Is there a better way to achieve this?
请尝试下面的BigQuery标准SQL.不涉及JOIN
Try below for BigQuery Standard SQL. No JOINs involved
#standardSQL
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time
您可以使用下面的示例对问题中的伪数据进行测试/播放
You can test/play with it using below example with dummy data from your question
#standardSQL
WITH intervals AS (
SELECT 1 AS start_time, 10 AS end_time UNION ALL
SELECT 2, 5 UNION ALL
SELECT 3, 4 UNION ALL
SELECT 5, 6 UNION ALL
SELECT 7, 11 UNION ALL
SELECT 19, 20
)
SELECT
start_time,
(SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt
FROM (
SELECT
start_time,
ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
FROM intervals
)
-- ORDER BY start_time
这篇关于BigQuery中重叠间隔的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!