在Rails + Postgres中按任意时间间隔对记录进行计数的最佳方法 [英] Best way to count records by arbitrary time intervals in Rails+Postgres

查看:96
本文介绍了在Rails + Postgres中按任意时间间隔对记录进行计数的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的应用程序有一个带有时间戳事件的Events表.

My app has a Events table with time-stamped events.

我需要报告每个最近的N时间间隔内的事件计数.对于不同的报告,间隔可以是每周"或每天"或每个小时"或每个15分钟间隔".

I need to report the count of events during each of the most recent N time intervals. For different reports, the interval could be "each week" or "each day" or "each hour" or "each 15-minute interval".

例如,用户可以显示他们每周,每天,每小时或每刻钟收到多少订单.

For example, a user can display how many orders they received each week, day, or hour, or quarter-hour.

1)我的偏好是动态地执行单个SQL查询(我使用的是Postgres),该查询按任意时间间隔进行分组.有办法吗?

1) My preference is to dynamically do a single SQL query (I'm using Postgres) that groups by an arbitrary time interval. Is there a way to do that?

2)一种简单但丑陋的暴力方式是对按时间戳排序的开始/结束时间范围内的所有记录进行单个查询,然后使用一种方法以任意间隔手动构建计数.

2) An easy but ugly brute force way is to do a single query for all records within the start/end timeframe sorted by timestamp, then have a method manually build a tally by whatever interval.

3)另一种方法是为每个时间间隔在事件表中添加单独的字段,并静态存储the_week the_daythe_hourthe_quarter_hour字段,因此我在记录创建一次(而不是每次我在该字段上报告时).

3) Another approach would be add separate fields to the event table for each interval and statically store an the_week the_day, the_hour, and the_quarter_hour field so I take the 'hit' at the time the record is created (once) instead of every time I report on that field.

鉴于我可以根据需要修改模型并预先存储间隔数据(尽管以增加表宽为代价)的最佳做法是什么?

What's best practice here, given I could modify the model and pre-store interval data if required (although at the modest expense of doubling the table width)?

推荐答案

幸运的是,您正在使用PostgreSQL.窗口函数 generate_series() 是你的朋友.

Luckily, you are using PostgreSQL. The window function generate_series() is your friend.

给出以下测试表(您应该提供的 ):

Given the following test table (which you should have provided):

CREATE TABLE event(event_id serial, ts timestamp);
INSERT INTO event (ts)
SELECT generate_series(timestamp '2018-05-01'
                     , timestamp '2018-05-08'
                     , interval '7 min') + random() * interval '7 min';

每7分钟1个事件(随机增加0至7分钟).

此查询对任意任意时间间隔内的事件进行计数.在示例中为17分钟:

This query counts events for any arbitrary time interval. 17 minutes in the example:

WITH grid AS (
   SELECT start_time
        , lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
   FROM  (
      SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
      FROM   event
      ) sub
   )
SELECT start_time, count(e.ts) AS events
FROM   grid       g
LEFT   JOIN event e ON e.ts >= g.start_time
                   AND e.ts <  g.end_time
GROUP  BY start_time
ORDER  BY start_time;

  • 查询从基表中检索最小和最大ts,以覆盖整个时间范围.您可以改用任意时间范围.

    • The query retrieves minimum and maximum ts from the base table to cover the complete time range. You can use an arbitrary time range instead.

      根据需要提供 任何时间间隔.

      Provide any time interval as needed.

      为每个 时隙生产一行.如果在该时间间隔内未发生任何事件,则计数为0.

      Produces one row for every time slot. If no event happened during that interval, the count is 0.

      请确保正确处理上下限:

      窗口函数 lead() 具有一个经常被忽略的功能:当没有前导行存在时,它可以提供默认值.在以下位置提供 'infinity' 这个例子.否则,最后一个间隔将以上限NULL结束.

      上面的查询使用CTE和lead()以及冗长的语法.优雅,也许更容易理解,但价格更高.这是一个较短,更快,最小的版本:

      The above query uses a CTE and lead() and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:

      SELECT start_time, count(e.ts) AS events
      FROM  (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
      LEFT   JOIN event e ON e.ts >= g.start_time
                         AND e.ts <  g.start_time + interval '17 min'
      GROUP  BY 1
      ORDER  BY 1;
      

      "过去一周中的每15分钟"的示例"

      并使用 to_char() 进行格式化.

      Example for "every 15 minutes in the past week"`

      And formatting with to_char().

      SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
      FROM   generate_series(date_trunc('day', localtimestamp - interval '7 days')
                           , localtimestamp
                           , interval '15 min') g(start_time)
      LEFT   JOIN event e ON e.ts >= g.start_time
                         AND e.ts <  g.start_time + interval '15 min'
      GROUP  BY start_time
      ORDER  BY start_time;

      仍然在基础时间戳记 value 上保留ORDER BYGROUP BY,而不是在格式化的字符串上.这样更快,更可靠.

      Still ORDER BY and GROUP BY on the underlying timestamp value, not on the formatted string. That's faster and more reliable.

      db<>小提琴此处

      相关答案在时间范围内产生运行次数:

      Related answer producing a running count over the time frame:

      这篇关于在Rails + Postgres中按任意时间间隔对记录进行计数的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆