BigQuery:如何在滚动时间戳窗口内对行进行分组和计数? [英] BigQuery: how to group and count rows within rolling timestamp window?

查看:23
本文介绍了BigQuery:如何在滚动时间戳窗口内对行进行分组和计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 MongoDB 有一些经验,并且正在学习 BigQuery.我正在尝试执行以下任务,但我不知道如何使用 BigQuery 的标准 SQL 来完成.

I have some experience with MongoDB and I'm learning about BigQuery. I'm trying to perform the following task, and I don't know how to do it using BigQuery's standard SQL.

我有一个包含以下数据的表格.它包含发生在不同网站 url 上的事件.时间戳表示给定事件发生的时间.例如,第一行的意思是事件 'xx' 在 2016-10-18 15:55:16 UTC 在 url 'a.html' 上发生."

I have a table with the following data. It contains events that occur on different website urls. Timestamp represents when the given event occurred. For example, the first row means, "event 'xx' occurred on url 'a.html' at 2016-10-18 15:55:16 UTC."

event_id |    url    |          timestamp   
-----------------------------------------------------------
   xx         a.html      2016-10-18 15:55:16 UTC
   xx         a.html      2016-10-19 16:68:55 UTC
   xx         a.html      2016-10-25 20:55:57 UTC
   yy         b.html      2016-10-18 15:58:09 UTC
   yy         a.html      2016-10-18 08:32:43 UTC
   zz         a.html      2016-10-20 04:44:22 UTC
   zz         c.html      2016-10-21 02:12:34 UTC

我想计算在滚动的 3 天窗口内每个 url 上发生的每个事件的数量.换句话说,我希望能够说以下内容:

I want to count the number of each event that occurred on each url over a over a rolling 3 day window. In other words, I want to be able to say the following:

  • "在 url 'a.html' 上,在 [2016-10-18 00:00:00 UTC, 2016-10-21 00:00:00 UTC) 期间,事件 'xx' 发生两次."

  • "on the url 'a.html', during the interval [2016-10-18 00:00:00 UTC, 2016-10-21 00:00:00 UTC), event 'xx' occurred twice."

"在 url 'a.html' 上,在 [2016-10-19 00:00:00 UTC, 2016-10-22 00:00:00 UTC) 期间,事件 'xx' 发生一次."

"on the url 'a.html', during the interval [2016-10-19 00:00:00 UTC, 2016-10-22 00:00:00 UTC), event 'xx' occurred once."

"在 url 'a.html' 上,在 [2016-10-20 00:00:00 UTC, 2016-10-23 00:00:00 UTC) 期间,事件 'xx' 发生零次."(注意:这不需要作为一行返回.没有这一行可能意味着该事件发生了零次.)

"on the url 'a.html', during the interval [2016-10-20 00:00:00 UTC, 2016-10-23 00:00:00 UTC), event 'xx' occurred zero times." (NOTE: THIS DOES NOT NEED TO BE RETURNED AS A ROW. The absence of this row can imply that the event occurred zero times.)

一些注意事项:我的数据库每天包含超过 10 万行,并且发生的事件各不相同.也就是说,在 1 天内,事件 'xx' 将发生约 10,000 次,事件 'zz' 将发生约 0-2 次.

Some notes: my database contains over 100k rows per day, and the occurrence of events varies. Meaning, in 1 day, event 'xx' will occur ~10,000 times and event 'zz' will occur ~0-2 times.

鉴于我有限的 SQL 知识,我不想为结果表提供结构,因为我认为这可能会错误地限制可能的答案.谢谢!

Given my limited SQL knowledge, I didn't want to provide structure for the resulting table, because I figured that might incorrectly limit possible answers. Thanks!

推荐答案

以下是 BigQuery Standard SQL(参见 启用标准 SQL

Below is for BigQuery Standard SQL (see Enabling Standard SQL

我使用 ts 作为字段名称(而不是 timestamp,因为它在你的例子中)并假设这个字段是 TIMESTAMP 数据输入

I am using ts as a field name (instead timestamp as it is in your example) and assume this field is of TIMESTAMP data type

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
)
SELECT 
  url, event_id, day, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling3daysEvents
FROM dailyAggregations
-- ORDER BY url, event_id, day

259200 的值实际上是 3x24x3600 所以设置了 3 天的范围,所以你可以设置你需要的任何实际滚动周期

The value of 259200 is actually 3x24x3600 so sets 3 days range, so you can set whatever actual rolling period you need

这篇关于BigQuery:如何在滚动时间戳窗口内对行进行分组和计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆