BigQuery中重叠间隔的计数 [英] Count of overlapping intervals in BigQuery

查看:41
本文介绍了BigQuery中重叠间隔的计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

给定一个间隔表,是否可以有效地查询每个间隔开始时当前打开的间隔的数量(包括当前间隔本身)?

Given a table of intervals, can I efficiently query for the number of currently open intervals at the start of each interval (including the current interval itself)?

例如,给定下表:


start_time end_time
         1       10
         2        5
         3        4
         5        6
         7       11
        19       20

我想要以下输出:


start_time count
         1     1
         2     2
         3     3
         5     3
         7     2
        19     1

在小型数据集上,我可以通过将数据集与其自身结合来解决此问题:

On small datasets, I can solve this problem by joining the dataset against itself:

WITH intervals AS (
  SELECT 1 AS start, 10 AS end UNION ALL
  SELECT 2, 5 UNION ALL
  SELECT 3, 4 UNION ALL
  SELECT 5, 6 UNION ALL
  SELECT 7, 11 UNION ALL
  SELECT 19, 20
)
SELECT 
  a.start_time,
  count(*)
FROM 
  intervals a CROSS JOIN intervals b
WHERE
  a.start_time >= b.start_time AND
  a.start_time <= b.end_time
GROUP BY a.start_time
ORDER BY a.start_time

对于大型数据集,CROSS JOIN既不切实际又不必要,因为任何给定的答案仅取决于少量的先前间隔(按 start_time 排序时).实际上,在我拥有的数据集上,它超时.有没有更好的方法来实现这一目标?

With large datasets the CROSS JOIN is both impractical and unnecessary, since any given answer only depends on a small number of preceding intervals (when sorted by start_time). In fact, on the dataset that I have, it times out. Is there a better way to achieve this?

推荐答案

... CROSS JOIN既不切实际又不必要......
有没有更好的方法来实现这一目标?

... CROSS JOIN is both impractical and unnecessary ...
Is there a better way to achieve this?

请尝试下面的BigQuery标准SQL.不涉及JOIN

Try below for BigQuery Standard SQL. No JOINs involved

#standardSQL
SELECT 
  start_time,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt  
FROM (
  SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
  FROM intervals
)
-- ORDER BY start_time  

您可以使用下面的示例对问题中的伪数据进行测试/播放

You can test/play with it using below example with dummy data from your question

#standardSQL
WITH intervals AS (
  SELECT 1 AS start_time, 10 AS end_time UNION ALL
  SELECT 2, 5 UNION ALL
  SELECT 3, 4 UNION ALL
  SELECT 5, 6 UNION ALL
  SELECT 7, 11 UNION ALL
  SELECT 19, 20 
)
SELECT 
  start_time,
  (SELECT COUNT(1) FROM UNNEST(ends) AS e WHERE e >= start_time) AS cnt  
FROM (
  SELECT 
    start_time, 
    ARRAY_AGG(end_time) OVER(ORDER BY start_time) AS ends
  FROM intervals
)
-- ORDER BY start_time

这篇关于BigQuery中重叠间隔的计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆