BigQuery:如何在滚动时间戳窗口中对行进行分组和计数? [英] BigQuery: how to group and count rows within rolling timestamp window?

查看:97
本文介绍了BigQuery:如何在滚动时间戳窗口中对行进行分组和计数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对MongoDB有一些经验,我正在学习BigQuery。我试图执行以下任务,并且我不知道如何使用BigQuery的标准SQL来完成它。



我有一个包含以下数据的表。它包含发生在不同网站上的事件。时间戳表示发生给定事件的时间。例如,第一行意味着事件'xx'发生在url'a.html'于2016-10-18 15:55:16 UTC。

  event_id | url |时间戳
--------------------------------------------- --------------
xx a.html 2016-10-18 15:55:16 UTC
xx a.html 2016-10-19 16:68: 55 UTC
xx a.html 2016-10-25 20:55:57 UTC
yy b.html 2016-10-18 15:58:09 UTC
yy a.html 2016- 10-18 08:32:43 UTC
zz a.html 2016-10-20 04:44:22 UTC
zz c.html 2016-10-21 02:12:34 UTC

我想要统计每个网址在滚动3天内发生的每个事件的数量。换句话说,我希望能够在网址a.html中说出以下内容:


  • ,期间间隔[2016-10-18 00:00:00 UTC,2016-10-21 00:00:00 UTC),事件'xx'发生两次。


  • b
  • 00:00:00 UTC),事件'xx'发生零次。 (注意:这不需要作为行返回,没有这一行可能意味着事件发生了零次。)




一些说明:我的数据库每天包含超过10万行,并且事件的发生有所不同。这意味着,在1天内,事件'xx'将发生〜10,000次,事件'zz'将发生〜0-2次。

鉴于我有限的SQL知识,不想为结果表提供结构,因为我认为这可能会错误地限制可能的答案。感谢! 解决方案

以下是适用于BigQuery标准SQL的内容(请参阅启用标准SQL



我正在使用 ts 作为字段名称(而不是 timestamp ,就像在你的例子中那样)并且假定这个字段是 TIMESTAMP b
$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $'
url,
event_id,
UNIX_SECONDS(TIMESTAMP(DATE(ts)))AS秒,
COUNT(1)AS事件
FROM yourTable
GROUP BY day,url,event_id,sec

SELECT
url,event_id,day,events,
SUM(events)
OVER(PARTITION BY url,event_id ORDER按秒
范围在259200先行和当前行
)AS rolling3daysEvents
FROM dailyAggregations
- ORDER BY url,event_id,day

v 259200的数量实际上是3x24x3600,因此设置了3天的范围,因此您可以设置任何您需要的实际滚动周期。

I have some experience with MongoDB and I'm learning about BigQuery. I'm trying to perform the following task, and I don't know how to do it using BigQuery's standard SQL.

I have a table with the following data. It contains events that occur on different website urls. Timestamp represents when the given event occurred. For example, the first row means, "event 'xx' occurred on url 'a.html' at 2016-10-18 15:55:16 UTC."

event_id |    url    |          timestamp   
-----------------------------------------------------------
   xx         a.html      2016-10-18 15:55:16 UTC
   xx         a.html      2016-10-19 16:68:55 UTC
   xx         a.html      2016-10-25 20:55:57 UTC
   yy         b.html      2016-10-18 15:58:09 UTC
   yy         a.html      2016-10-18 08:32:43 UTC
   zz         a.html      2016-10-20 04:44:22 UTC
   zz         c.html      2016-10-21 02:12:34 UTC

I want to count the number of each event that occurred on each url over a over a rolling 3 day window. In other words, I want to be able to say the following:

  • "on the url 'a.html', during the interval [2016-10-18 00:00:00 UTC, 2016-10-21 00:00:00 UTC), event 'xx' occurred twice."

  • "on the url 'a.html', during the interval [2016-10-19 00:00:00 UTC, 2016-10-22 00:00:00 UTC), event 'xx' occurred once."

  • "on the url 'a.html', during the interval [2016-10-20 00:00:00 UTC, 2016-10-23 00:00:00 UTC), event 'xx' occurred zero times." (NOTE: THIS DOES NOT NEED TO BE RETURNED AS A ROW. The absence of this row can imply that the event occurred zero times.)

Some notes: my database contains over 100k rows per day, and the occurrence of events varies. Meaning, in 1 day, event 'xx' will occur ~10,000 times and event 'zz' will occur ~0-2 times.

Given my limited SQL knowledge, I didn't want to provide structure for the resulting table, because I figured that might incorrectly limit possible answers. Thanks!

解决方案

Below is for BigQuery Standard SQL (see Enabling Standard SQL

I am using ts as a field name (instead timestamp as it is in your example) and assume this field is of TIMESTAMP data type

WITH dailyAggregations AS (
  SELECT 
    DATE(ts) AS day, 
    url, 
    event_id, 
    UNIX_SECONDS(TIMESTAMP(DATE(ts))) AS sec, 
    COUNT(1) AS events 
  FROM yourTable
  GROUP BY day, url, event_id, sec
)
SELECT 
  url, event_id, day, events, 
  SUM(events) 
    OVER(PARTITION BY url, event_id ORDER BY sec 
      RANGE BETWEEN 259200 PRECEDING AND CURRENT ROW
  ) AS rolling3daysEvents
FROM dailyAggregations
-- ORDER BY url, event_id, day

The value of 259200 is actually 3x24x3600 so sets 3 days range, so you can set whatever actual rolling period you need

这篇关于BigQuery:如何在滚动时间戳窗口中对行进行分组和计数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆