BigQuery:如何执行每天生成行的滚动时间戳窗口组计数 [英] BigQuery: how to perform rolling timestamp window group count that produces row for each day
问题描述
这是我在 StackOverflow 上提出并解决的问题的扩展 此处.
this is an extension to a question that I asked and resolved on StackOverflow here.
我是 BigQuery 和 SQL 新手,我想构建一个标准 SQL 查询,该查询将在 X 天的滚动时间窗口内对事件进行分组和计数.我的数据表如下所示:
I'm a BigQuery and SQL novice and I wanted to construct a Standard SQL query that would group and count events over a rolling time window of X days. My data table looks like this:
event_id | url | timestamp
-----------------------------------------------------------
xx a.html 2016-10-18 15:55:16 UTC
xx a.html 2016-10-19 16:68:55 UTC
xx a.html 2016-10-25 20:55:57 UTC
yy b.html 2016-10-18 15:58:09 UTC
yy a.html 2016-10-18 08:32:43 UTC
zz a.html 2016-10-20 04:44:22 UTC
zz c.html 2016-10-21 02:12:34 UTC
我正在跟踪网址上发生的事件.我想知道在 X 天的滚动时间段内,每个事件在每个 url 上发生了多少次.当我问这个问题时,我得到了一个很好的答案:
I'm tracking events that occur on urls. I want to know how many times each event occurred on each url during a rolling time period of X days. When I asked this question, I got a great answer:
WITH dailyAggregations AS (
SELECT
DATE(ts) AS day,
url,
event_id,
UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec,
COUNT(1) AS events
FROM yourTable
GROUP BY day, url, event_id, sec
)
SELECT
url, event_id, day, events,
SUM(events)
OVER(PARTITION BY url, event_id ORDER BY sec
RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
) AS rolling4daysEvents
FROM dailyAggregations
其中 259200 是以秒为单位的 3 天 (3x24x3600).据我了解,此查询创建了一个中间表,按天对事件进行分组和计数.它还将时间戳字段转换为其 Unix 秒等效项.然后它使用以秒为单位的窗口总结事件.
where 259200 is 3 days in seconds (3x24x3600). As I understand it, this query creates an intermediate table that groups and counts events by day. It also converts the timestamp field into its unix second equivalent. Then it sums up the events using a window that is measured in seconds.
现在这将生成一个具有正确运行总数的表格,但它不能保证每个日期、网址和事件都有一行.换句话说,如果给定事件从未发生在给定 url 上的日期存在,则结果表中将缺少日期.最重要的是,我可以修改上面的查询(或构造一个不同的查询),以便为间隔中的每个日期正确生成 roll4daysEvents 的值吗?例如:像定义为的间隔:
Now this will produce a table with correct running totals, but it does not guarantee a row for every date, url, and event. In other words, there will be dates missing from the resultant table if there were dates when a given event never occurred on a given url. Bottom line, can I modify the above query (or construct a different query) that will correctly produce values for rolling4daysEvents for each date in an interval? eg: like an interval defined as:
SELECT *
FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
ORDER BY day ASC
谢谢!
推荐答案
WITH dailyAggregations AS (
SELECT
DATE(ts) AS day,
url,
event_id,
UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec,
COUNT(1) AS events
FROM yourTable
GROUP BY day, url, event_id, sec
),
calendar AS (
SELECT day
FROM UNNEST (GENERATE_DATE_ARRAY('2016-08-28', '2016-11-06')) AS day
)
SELECT
c.day, url, event_id, events,
SUM(events)
OVER(PARTITION BY url, event_id ORDER BY sec
RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
) AS rolling4daysEvents
FROM calendar AS c
LEFT JOIN dailyAggregations AS a
ON a.day = c.day
这篇关于BigQuery:如何执行每天生成行的滚动时间戳窗口组计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!