在 Rails+Postgres 中按任意时间间隔计算记录的最佳方法 [英] Best way to count records by arbitrary time intervals in Rails+Postgres

查看:22
本文介绍了在 Rails+Postgres 中按任意时间间隔计算记录的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的应用有一个带有时间戳事件的 Events 表.

My app has a Events table with time-stamped events.

我需要报告每个最近 N 时间间隔内的事件计数.对于不同的报告,间隔可以是每周"或每天"或每小时"或每 15 分钟间隔".

I need to report the count of events during each of the most recent N time intervals. For different reports, the interval could be "each week" or "each day" or "each hour" or "each 15-minute interval".

例如,用户可以显示他们每周、每天、每小时或每刻钟收到的订单数量.

For example, a user can display how many orders they received each week, day, or hour, or quarter-hour.

1) 我的偏好是动态执行按任意时间间隔分组的单个 SQL 查询(我使用的是 Postgres).有没有办法做到这一点?

1) My preference is to dynamically do a single SQL query (I'm using Postgres) that groups by an arbitrary time interval. Is there a way to do that?

2) 一种简单但丑陋的蛮力方法是对按时间戳排序的开始/结束时间范围内的所有记录进行单个查询,然后使用一种方法按任意间隔手动构建计数.

2) An easy but ugly brute force way is to do a single query for all records within the start/end timeframe sorted by timestamp, then have a method manually build a tally by whatever interval.

3) 另一种方法是为每个间隔向事件表添加单独的字段,并静态存储 the_week the_daythe_hourthe_quarter_hour 字段,因此我在创建记录时(一次)进行命中",而不是每次报告该字段时.

3) Another approach would be add separate fields to the event table for each interval and statically store an the_week the_day, the_hour, and the_quarter_hour field so I take the 'hit' at the time the record is created (once) instead of every time I report on that field.

这里的最佳实践是什么,如果需要我可以修改模型和预存储间隔数据(尽管以将表格宽度加倍为代价)?

What's best practice here, given I could modify the model and pre-store interval data if required (although at the modest expense of doubling the table width)?

推荐答案

幸运的是,您正在使用 PostgreSQL.窗口函数 generate_series() 是你的朋友.

Luckily, you are using PostgreSQL. The window function generate_series() is your friend.

鉴于以下测试表(应该提供):

Given the following test table (which you should have provided):

CREATE TABLE event(event_id serial, ts timestamp);
INSERT INTO event (ts)
SELECT generate_series(timestamp '2018-05-01'
                     , timestamp '2018-05-08'
                     , interval '7 min') + random() * interval '7 min';

每 7 分钟(加上 0 到 7 分钟,随机)一个事件.

此查询对任意时间间隔内的事件进行计数.示例中的 17 分钟:

This query counts events for any arbitrary time interval. 17 minutes in the example:

WITH grid AS (
   SELECT start_time
        , lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
   FROM  (
      SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
      FROM   event
      ) sub
   )
SELECT start_time, count(e.ts) AS events
FROM   grid       g
LEFT   JOIN event e ON e.ts >= g.start_time
                   AND e.ts <  g.end_time
GROUP  BY start_time
ORDER  BY start_time;

  • 查询从基表中检索最小和最大 ts 以覆盖整个时间范围.您可以改用任意时间范围.

    • The query retrieves minimum and maximum ts from the base table to cover the complete time range. You can use an arbitrary time range instead.

      根据需要提供任何时间间隔.

      Provide any time interval as needed.

      每个时间段生成一行.如果在该时间间隔内没有发生任何事件,则计数为 0.

      Produces one row for every time slot. If no event happened during that interval, the count is 0.

      一定要正确处理上限和下限:

      窗口函数lead() 有一个经常被忽视的特性:它可以在不存在前导行时提供默认值.提供 'infinity'.否则最后一个间隔将被一个上限 NULL 截断.

      上述查询使用 CTE 和 lead() 和详细语法.优雅,也许更容易理解,但有点贵.这是一个更短、更快、最小的版本:

      The above query uses a CTE and lead() and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:

      SELECT start_time, count(e.ts) AS events
      FROM  (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
      LEFT   JOIN event e ON e.ts >= g.start_time
                         AND e.ts <  g.start_time + interval '17 min'
      GROUP  BY 1
      ORDER  BY 1;
      

      例如过去一周每 15 分钟"`

      并使用 to_char().

      SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
      FROM   generate_series(date_trunc('day', localtimestamp - interval '7 days')
                           , localtimestamp
                           , interval '15 min') g(start_time)
      LEFT   JOIN event e ON e.ts >= g.start_time
                         AND e.ts <  g.start_time + interval '15 min'
      GROUP  BY start_time
      ORDER  BY start_time;

      仍然 ORDER BYGROUP BY 在基础时间戳 value 上,而不是在格式化字符串上.这样更快更可靠.

      Still ORDER BY and GROUP BY on the underlying timestamp value, not on the formatted string. That's faster and more reliable.

      db<>fiddle 这里

      在时间范围内产生运行计数的相关答案:

      Related answer producing a running count over the time frame:

      这篇关于在 Rails+Postgres 中按任意时间间隔计算记录的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆