BigQuery:如何执行每天生成行的滚动时间戳窗口组计数 [英] BigQuery: how to perform rolling timestamp window group count that produces row for each day

查看:13
本文介绍了BigQuery:如何执行每天生成行的滚动时间戳窗口组计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我在 StackOverflow 上提出并解决的问题的扩展 此处.

this is an extension to a question that I asked and resolved on StackOverflow here.

我是 BigQuery 和 SQL 新手,我想构建一个标准 SQL 查询,该查询将在 X 天的滚动时间窗口内对事件进行分组和计数.我的数据表如下所示:

I'm a BigQuery and SQL novice and I wanted to construct a Standard SQL query that would group and count events over a rolling time window of X days. My data table looks like this:

event_id |    url    |          timestamp   
-----------------------------------------------------------
xx         a.html      2016-10-18 15:55:16 UTC
xx         a.html      2016-10-19 16:68:55 UTC
xx         a.html      2016-10-25 20:55:57 UTC
yy         b.html      2016-10-18 15:58:09 UTC
yy         a.html      2016-10-18 08:32:43 UTC
zz         a.html      2016-10-20 04:44:22 UTC
zz         c.html      2016-10-21 02:12:34 UTC

我正在跟踪网址上发生的事件.我想知道在 X 天的滚动时间段内,每个事件在每个 url 上发生了多少次.当我问这个问题时,我得到了一个很好的答案:

I'm tracking events that occur on urls. I want to know how many times each event occurred on each url during a rolling time period of X days. When I asked this question, I got a great answer:

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
)
SELECT 
  url, event_id, day, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling4daysEvents
FROM dailyAggregations

其中 259200 是以秒为单位的 3 天 (3x24x3600).据我了解,此查询创建了一个中间表,按天对事件进行分组和计数.它还将时间戳字段转换为其 Unix 秒等效项.然后它使用以秒为单位的窗口总结事件.

where 259200 is 3 days in seconds (3x24x3600). As I understand it, this query creates an intermediate table that groups and counts events by day. It also converts the timestamp field into its unix second equivalent. Then it sums up the events using a window that is measured in seconds.

现在这将生成一个具有正确运行总数的表格,但它不能保证每个日期、网址和事件都有一行.换句话说,如果给定事件从未发生在给定 url 上的日期存在,则结果表中将缺少日期.最重要的是,我可以修改上面的查询(或构造一个不同的查询),以便为间隔中的每个日期正确生成 roll4daysEvents 的值吗?例如:像定义为的间隔:

Now this will produce a table with correct running totals, but it does not guarantee a row for every date, url, and event. In other words, there will be dates missing from the resultant table if there were dates when a given event never occurred on a given url. Bottom line, can I modify the above query (or construct a different query) that will correctly produce values for rolling4daysEvents for each date in an interval? eg: like an interval defined as:

SELECT *
  FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
  ORDER BY day ASC

谢谢!

推荐答案

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
),
calendar AS (
  SELECT day
  FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
)
SELECT 
  c.day, url, event_id, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling4daysEvents
FROM calendar AS c
LEFT JOIN dailyAggregations AS a
ON a.day = c.day

这篇关于BigQuery:如何执行每天生成行的滚动时间戳窗口组计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆