BigQuery:如何在滚动时间戳窗口内对行进行分组和计数? [英] BigQuery: how to group and count rows within rolling timestamp window?
问题描述
我对 MongoDB 有一些经验,并且正在学习 BigQuery.我正在尝试执行以下任务,但我不知道如何使用 BigQuery 的标准 SQL 来完成.
I have some experience with MongoDB and I'm learning about BigQuery. I'm trying to perform the following task, and I don't know how to do it using BigQuery's standard SQL.
我有一个包含以下数据的表格.它包含发生在不同网站 url 上的事件.时间戳表示给定事件发生的时间.例如,第一行的意思是事件 'xx' 在 2016-10-18 15:55:16 UTC 在 url 'a.html' 上发生."
I have a table with the following data. It contains events that occur on different website urls. Timestamp represents when the given event occurred. For example, the first row means, "event 'xx' occurred on url 'a.html' at 2016-10-18 15:55:16 UTC."
event_id | url | timestamp
-----------------------------------------------------------
xx a.html 2016-10-18 15:55:16 UTC
xx a.html 2016-10-19 16:68:55 UTC
xx a.html 2016-10-25 20:55:57 UTC
yy b.html 2016-10-18 15:58:09 UTC
yy a.html 2016-10-18 08:32:43 UTC
zz a.html 2016-10-20 04:44:22 UTC
zz c.html 2016-10-21 02:12:34 UTC
我想计算在滚动的 3 天窗口内每个 url 上发生的每个事件的数量.换句话说,我希望能够说以下内容:
I want to count the number of each event that occurred on each url over a over a rolling 3 day window. In other words, I want to be able to say the following:
"在 url 'a.html' 上,在 [2016-10-18 00:00:00 UTC, 2016-10-21 00:00:00 UTC) 期间,事件 'xx' 发生两次."
"on the url 'a.html', during the interval [2016-10-18 00:00:00 UTC, 2016-10-21 00:00:00 UTC), event 'xx' occurred twice."
"在 url 'a.html' 上,在 [2016-10-19 00:00:00 UTC, 2016-10-22 00:00:00 UTC) 期间,事件 'xx' 发生一次."
"on the url 'a.html', during the interval [2016-10-19 00:00:00 UTC, 2016-10-22 00:00:00 UTC), event 'xx' occurred once."
"在 url 'a.html' 上,在 [2016-10-20 00:00:00 UTC, 2016-10-23 00:00:00 UTC) 期间,事件 'xx' 发生零次."(注意:这不需要作为一行返回.没有这一行可能意味着该事件发生了零次.)
"on the url 'a.html', during the interval [2016-10-20 00:00:00 UTC, 2016-10-23 00:00:00 UTC), event 'xx' occurred zero times." (NOTE: THIS DOES NOT NEED TO BE RETURNED AS A ROW. The absence of this row can imply that the event occurred zero times.)
一些注意事项:我的数据库每天包含超过 10 万行,并且发生的事件各不相同.也就是说,在 1 天内,事件 'xx' 将发生约 10,000 次,事件 'zz' 将发生约 0-2 次.
Some notes: my database contains over 100k rows per day, and the occurrence of events varies. Meaning, in 1 day, event 'xx' will occur ~10,000 times and event 'zz' will occur ~0-2 times.
鉴于我有限的 SQL 知识,我不想为结果表提供结构,因为我认为这可能会错误地限制可能的答案.谢谢!
Given my limited SQL knowledge, I didn't want to provide structure for the resulting table, because I figured that might incorrectly limit possible answers. Thanks!
推荐答案
以下是 BigQuery Standard SQL(参见 启用标准 SQL
Below is for BigQuery Standard SQL (see Enabling Standard SQL
我使用 ts
作为字段名称(而不是 timestamp
,因为它在你的例子中)并假设这个字段是 TIMESTAMP
数据输入
I am using ts
as a field name (instead timestamp
as it is in your example) and assume this field is of TIMESTAMP
data type
WITH dailyAggregations AS (
SELECT
DATE(ts) AS day,
url,
event_id,
UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec,
COUNT(1) AS events
FROM yourTable
GROUP BY day, url, event_id, sec
)
SELECT
url, event_id, day, events,
SUM(events)
OVER(PARTITION BY url, event_id ORDER BY sec
RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
) AS rolling3daysEvents
FROM dailyAggregations
-- ORDER BY url, event_id, day
259200 的值实际上是 3x24x3600 所以设置了 3 天的范围,所以你可以设置你需要的任何实际滚动周期
The value of 259200 is actually 3x24x3600 so sets 3 days range, so you can set whatever actual rolling period you need
这篇关于BigQuery:如何在滚动时间戳窗口内对行进行分组和计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!