每周汇总最近加入的记录 [英] Aggregating the most recent joined records per week
问题描述
我在Postgres中有一个 updates
表是9.4.5,就像这样:
I have an updates
table in Postgres is 9.4.5 like this:
goal_id | created_at | status
1 | 2016-01-01 | green
1 | 2016-01-02 | red
2 | 2016-01-02 | amber
和目标
表如下:
id | company_id
1 | 1
2 | 2
我想为每个公司创建一个图表,以显示每个目标的所有状态
I want to create a chart for each company that shows the state of all of their goals, per week.
我想这会产生一系列在过去8周内,找到该周之前每个目标的最新更新,然后计算找到的更新的不同状态。
I image this would require to generate a series of the past 8 weeks, finding the most recent update for each goal that came before that week, then counting the different statuses of the found updates.
到目前为止,我有什么:
What I have so far:
SELECT EXTRACT(year from generate_series) AS year,
EXTRACT(week from generate_series) AS week,
u.company_id,
COUNT(*) FILTER (WHERE u.status = 'green') AS green_count,
COUNT(*) FILTER (WHERE u.status = 'amber') AS amber_count,
COUNT(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(NOW() - INTERVAL '2 MONTHS', NOW(), '1 week')
LEFT OUTER JOIN (
SELECT DISTINCT ON(year, week)
goals.company_id,
updates.status,
EXTRACT(week from updates.created_at) week,
EXTRACT(year from updates.created_at) AS year,
updates.created_at
FROM updates
JOIN goals ON goals.id = updates.goal_id
ORDER BY year, week, updates.created_at DESC
) u ON u.week = week AND u.year = year
GROUP BY 1,2,3
但这有两个问题。似乎在 u
上的联接无法正常工作。它似乎在内部查询返回的每一行(?)上联接,并且这仅选择该周以来发生的最新更新。如果需要,它应该获取该周之前的最新更新。
But this has two problems. It seems that the join on u
isn't working as I thought it would. It seems to be joining on every row (?) returned from the inner query as well as this only selects the most recent update that happened from that week. It should grab the most recent update from before that week if it needs to.
这是一些非常复杂的SQL,我喜欢其中的一些输入
This is some pretty complicated SQL and I love some input on how to pull it off.
目标表具有约1000个目标ATM和每周大约增长100:
The goals table has around ~1000 goals ATM and is growing about ~100 a week:
Table "goals"
Column | Type | Modifiers
-----------------+-----------------------------+-----------------------------------------------------------
id | integer | not null default nextval('goals_id_seq'::regclass)
company_id | integer | not null
name | text | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goals_pkey" PRIMARY KEY, btree (id)
"entity_goals_company_id_fkey" btree (company_id)
Foreign-key constraints:
"goals_company_id_fkey" FOREIGN KEY (company_id) REFERENCES companies(id) ON DELETE RESTRICT
更新
表大约有1000个,并且每周增长大约100个:
The updates
table has around ~1000 and is growing around ~100 a week:
Table "updates"
Column | Type | Modifiers
------------+-----------------------------+------------------------------------------------------------------
id | integer | not null default nextval('updates_id_seq'::regclass)
status | entity.goalstatus | not null
goal_id | integer | not null
created_at | timestamp without time zone | not null default timezone('utc'::text, now())
updated_at | timestamp without time zone | not null default timezone('utc'::text, now())
Indexes:
"goal_updates_pkey" PRIMARY KEY, btree (id)
"entity_goal_updates_goal_id_fkey" btree (goal_id)
Foreign-key constraints:
"updates_goal_id_fkey" FOREIGN KEY (goal_id) REFERENCES goals(id) ON DELETE CASCADE
Schema | Name | Internal name | Size | Elements | Access privileges | Description
--------+-------------------+---------------+------+----------+-------------------+-------------
entity | entity.goalstatus | goalstatus | 4 | green +| |
| | | | amber +| |
| | | | red | |
推荐答案
您每周和目标需要一个数据项(之前每个公司的汇总计数)。这是 generate_series()
与目标
之间的交叉加入
。 (可能)昂贵的部分是从每个更新
获取当前的状态
。就像 @Paul已经建议一样, LATERAL
联接似乎是最好的工具。不过,仅对更新
这样做,并使用 LIMIT 1
的更快技术。
You need one data item per week and goal (before aggregating counts per company). That's a plain CROSS JOIN
between generate_series()
and goals
. The (possibly) expensive part is to get the current state
from updates
for each. Like @Paul already suggested, a LATERAL
join seems like the best tool. Do it only for updates
, though, and use a faster technique with LIMIT 1
.
,并使用 date_trunc()
。
SELECT w_start
, g.company_id
, count(*) FILTER (WHERE u.status = 'green') AS green_count
, count(*) FILTER (WHERE u.status = 'amber') AS amber_count
, count(*) FILTER (WHERE u.status = 'red') AS red_count
FROM generate_series(date_trunc('week', NOW() - interval '2 months')
, date_trunc('week', NOW())
, interval '1 week') w_start
CROSS JOIN goals g
LEFT JOIN LATERAL (
SELECT status
FROM updates
WHERE goal_id = g.id
AND created_at < w_start
ORDER BY created_at DESC
LIMIT 1
) u ON true
GROUP BY w_start, g.company_id
ORDER BY w_start, g.company_id;
要使此 快速 ,您需要多列索引:
CREATE INDEX updates_special_idx ON updates (goal_id, created_at DESC, status);
created_at
的降序是最好的,但是并非绝对必要。 Postgres可以几乎完全一样快地向后扫描索引。 (不适用于)
Descending order for created_at
is best, but not strictly necessary. Postgres can scan indexes backwards almost exactly as fast. (Not applicable for inverted sort order of multiple columns, though.)
按那个顺序索引列。为什么?
Index columns in that order. Why?
- Multicolumn index and performance
第三栏 status
仅附加到允许在更新$ c $上快速仅索引扫描 c>。相关案例:
And the third column status
is only appended to allow fast index-only scans on updates
. Related case:
- Slow index scans in large table
1k目标持续9周(您的2个月间隔至少重叠9周)仅需要仅1k行的第二张表的9k索引查找。对于像这样的小表,性能应该不是太大的问题。但是一旦每张表又增加了几千个,顺序扫描就会降低性能。
1k goals for 9 weeks (your interval of 2 months overlaps with at least 9 weeks) only require 9k index look-ups for the 2nd table of only 1k rows. For small tables like this, performance shouldn't be much of a problem. But once you have a couple of thousand more in each table, performance will deteriorate with sequential scans.
w_start
表示每个星期的开始。因此,计数是一周的开始。如果您坚持认为,您仍然可以提取年和周(或其他任何代表您一周的信息)
w_start
represents the start of each week. Consequently, counts are for the start of the week. You can still extract year and week (or any other details represent your week), if you insist:
EXTRACT(isoyear from w_start) AS year
, EXTRACT(week from w_start) AS week
最好与 ISOYEAR
,就像@Paul解释的那样。
Best with ISOYEAR
, like @Paul explained.
相关:
- What is the difference between LATERAL and a subquery in PostgreSQL?
- Optimize GROUP BY query to retrieve latest record per user
- Select first row in each GROUP BY group?
- PostgreSQL: running count of rows for a query 'by minute'
这篇关于每周汇总最近加入的记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!