每周汇总最近加入的记录 [英] Aggregating the most recent joined records per week

查看:66
本文介绍了每周汇总最近加入的记录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Postgres中有一个 updates 表是9.4.5,就像这样:

I have an updates table in Postgres is 9.4.5 like this:

goal_id    | created_at | status
1          | 2016-01-01 | green
1          | 2016-01-02 | red
2          | 2016-01-02 | amber

目标表如下:

id | company_id
1  | 1
2  | 2

我想为每个公司创建一个图表,以显示每个目标的所有状态

I want to create a chart for each company that shows the state of all of their goals, per week.

我想这会产生一系列在过去8周内,找到该周之前每个目标的最新更新,然后计算找到的更新的不同状态。

I image this would require to generate a series of the past 8 weeks, finding the most recent update for each goal that came before that week, then counting the different statuses of the found updates.

到目前为止,我有什么:

What I have so far:

SELECT EXTRACT(year from generate_series) AS year, 
       EXTRACT(week from generate_series) AS week,
       u.company_id,
       COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
       COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
       COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
  SELECT DISTINCT ON(year, week)
         goals.company_id,
         updates.status, 
         EXTRACT(week from updates.created_at) week,
         EXTRACT(year from updates.created_at) AS year,
         updates.created_at 
  FROM updates
  JOIN goals ON goals.id = updates.goal_id
  ORDER BY year, week, updates.created_at DESC
) u ON u.week = week AND u.year = year
GROUP BY 1,2,3

但这有两个问题。似乎在 u 上的联接无法正常工作。它似乎在内部查询返回的每一行(?)上联接,并且这仅选择该周以来发生的最新更新。如果需要,它应该获取该周之前的最新更新。

But this has two problems. It seems that the join on u isn't working as I thought it would. It seems to be joining on every row (?) returned from the inner query as well as this only selects the most recent update that happened from that week. It should grab the most recent update from before that week if it needs to.

这是一些非常复杂的SQL,我喜欢其中的一些输入

This is some pretty complicated SQL and I love some input on how to pull it off.

目标表具有约1000个目标ATM和每周大约增长100:

The goals table has around ~1000 goals ATM and is growing about ~100 a week:

                                           Table "goals"
     Column      |            Type             |                         Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
 id              | integer                     | not null default nextval('goals_id_seq'::regclass)
 company_id      | integer                     | not null
 name            | text                        | not null
 created_at      | timestamp without time zone | not null default timezone('utc'::text, now())
 updated_at      | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
    "goals_pkey" PRIMARY KEY, btree (id)
    "entity_goals_company_id_fkey" btree (company_id)
Foreign-key constraints:
    "goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT

更新表大约有1000个,并且每周增长大约100个:

The updates table has around ~1000 and is growing around ~100 a week:

                                         Table "updates"
   Column   |            Type             |                            Modifiers
------------+-----------------------------+------------------------------------------------------------------
 id         | integer                     | not null default nextval('updates_id_seq'::regclass)
 status     | entity.goalstatus           | not null
 goal_id    | integer                     | not null
 created_at | timestamp without time zone | not null default timezone('utc'::text, now())
 updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
    "goal_updates_pkey" PRIMARY KEY, btree (id)
    "entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-key constraints:
    "updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE

 Schema |       Name        | Internal name | Size | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
 entity | entity.goalstatus | goalstatus    | 4    | green   +|                   |
        |                   |               |      | amber   +|                   |
        |                   |               |      | red      |                   |


推荐答案

您每周和目标需要一个数据项(之前每个公司的汇总计数)。这是 generate_series()目标之间的交叉加入 。 (可能)昂贵的部分是从每个更新获取当前的状态。就像 @Paul已经建议一样, LATERAL 联接似乎是最好的工具。不过,仅对更新这样做,并使用 LIMIT 1 的更快技术。

You need one data item per week and goal (before aggregating counts per company). That's a plain CROSS JOIN between generate_series() and goals. The (possibly) expensive part is to get the current state from updates for each. Like @Paul already suggested, a LATERAL join seems like the best tool. Do it only for updates, though, and use a faster technique with LIMIT 1.

,并使用 date_trunc()

SELECT w_start
     , g.company_id
     , count(*) FILTER (WHERE u.status = 'green') AS green_count
     , count(*) FILTER (WHERE u.status = 'amber') AS amber_count
     , count(*) FILTER (WHERE u.status = 'red')   AS red_count
FROM   generate_series(date_trunc('week', NOW() - interval '2 months')
                     , date_trunc('week', NOW())
                     , interval '1 week') w_start
CROSS  JOIN goals g
LEFT   JOIN LATERAL (
   SELECT status
   FROM   updates
   WHERE  goal_id = g.id
   AND    created_at < w_start
   ORDER  BY created_at DESC
   LIMIT  1
   ) u ON true
GROUP  BY w_start, g.company_id
ORDER  BY w_start, g.company_id;

要使此 快速 ,您需要多列索引

CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);

created_at 的降序是最好的,但是并非绝对必要。 Postgres可以几乎完全一样快地向后扫描索引。 (不适用于

Descending order for created_at is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)

那个顺序索引列。为什么?

Index columns in that order. Why?

  • Multicolumn index and performance

第三栏 status 仅附加到允许在更新仅索引扫描 c>。相关案例:

And the third column status is only appended to allow fast index-only scans on updates. Related case:

  • Slow index scans in large table

1k目标持续9周(您的2个月间隔至少重叠9周)仅需要仅1k行的第二张表的9k索引查找。对于像这样的小表,性能应该不是太大的问题。但是一旦每张表又增加了几千个,顺序扫描就会降低性能。

1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.

w_start 表示每个星期的开始。因此,计数是一周的开始。如果您坚持认为,您仍然可以提取年和周(或其他任何代表您一周的信息)

w_start represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:

   EXTRACT(isoyear from w_start) AS year
 , EXTRACT(week    from w_start) AS week

最好与 ISOYEAR ,就像@Paul解释的那样。

Best with ISOYEAR, like @Paul explained.

SQL提琴。

相关:

  • What is the difference between LATERAL and a subquery in PostgreSQL?
  • Optimize GROUP BY query to retrieve latest record per user
  • Select first row in each GROUP BY group?
  • PostgreSQL: running count of rows for a query 'by minute'

这篇关于每周汇总最近加入的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆