BigQuery:如何执行每天产生行的滚动时间戳窗口组计数 [英] BigQuery: how to perform rolling timestamp window group count that produces row for each day

查看:84
本文介绍了BigQuery:如何执行每天产生行的滚动时间戳窗口组计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在StackOverflow上提出并解决的问题的扩展

this is an extension to a question that I asked and resolved on StackOverflow here.

我是BigQuery和SQL新手,我想构建一个标准SQL查询,它可以对数组进行计数在X天的滚动时间窗口内发生的事件。我的数据表如下所示:

I'm a BigQuery and SQL novice and I wanted to construct a Standard SQL query that would group and count events over a rolling time window of X days. My data table looks like this:

event_id |    url    |          timestamp   
-----------------------------------------------------------
xx         a.html      2016-10-18 15:55:16 UTC
xx         a.html      2016-10-19 16:68:55 UTC
xx         a.html      2016-10-25 20:55:57 UTC
yy         b.html      2016-10-18 15:58:09 UTC
yy         a.html      2016-10-18 08:32:43 UTC
zz         a.html      2016-10-20 04:44:22 UTC
zz         c.html      2016-10-21 02:12:34 UTC

我正在跟踪发生在网址上的事件。我想知道在X天的滚动时间段内,每个网址上发生的每个事件的次数。当我问这个问题时,我得到了一个很好的答案:

I'm tracking events that occur on urls. I want to know how many times each event occurred on each url during a rolling time period of X days. When I asked this question, I got a great answer:

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
)
SELECT 
  url, event_id, day, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling4daysEvents
FROM dailyAggregations

其中259200是以秒为单位的3天(3x24x3600)。据我了解,这个查询创建一个中间表,按天分组和统计事件。它还将时间戳字段转换为其unix第二个等价物。然后使用一个以秒为单位的窗口来总结事件。

where 259200 is 3 days in seconds (3x24x3600). As I understand it, this query creates an intermediate table that groups and counts events by day. It also converts the timestamp field into its unix second equivalent. Then it sums up the events using a window that is measured in seconds.

现在这将生成一个正确运行总计的表格,但并不保证每行日期,网址和事件。换句话说,如果给定的事件从未在给定的url上发生过,那么在结果表中将会缺少日期。底线,我可以修改上述查询(或构建一个不同的查询),将正确地产生一个间隔的每个日期rolling4daysEvents值?例如:像定义为的时间间隔一样:

Now this will produce a table with correct running totals, but it does not guarantee a row for every date, url, and event. In other words, there will be dates missing from the resultant table if there were dates when a given event never occurred on a given url. Bottom line, can I modify the above query (or construct a different query) that will correctly produce values for rolling4daysEvents for each date in an interval? eg: like an interval defined as:

SELECT *
  FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
  ORDER BY day ASC

谢谢!

推荐答案

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
),
calendar AS (
  SELECT day
  FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
)
SELECT 
  c.day, url, event_id, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling4daysEvents
FROM calendar AS c
LEFT JOIN dailyAggregations AS a
ON a.day = c.day

这篇关于BigQuery:如何执行每天产生行的滚动时间戳窗口组计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆