Postgres中的时间序列查询 [英] Time series querying in Postgres
问题描述
这是@Erwin对 Postgres中有效的时间序列查询的回答中的一个问题。
This is a follow on question from @Erwin's answer to Efficient time series querying in Postgres.
为了使事情简单,我将使用与该问题相同的表结构
In order to keep things simple I'll use the same table structure as that question
id | widget_id | for_date | score |
最初的问题是获取范围内每个日期每个小部件的分数。如果日期没有小部件的条目,则显示该小部件上一个条目的得分。如果所有数据都包含在您要查询的范围内,则使用交叉联接和窗口函数的解决方案效果很好。我的问题是,即使以前的分数不在我们正在查看的日期范围内,我也希望得到。
The original question was to get score for each of the widgets for every date in a range. If there was no entry for a widget on a date then show the score from the previous entry for that widget. The solution using a cross join and a window function worked well if all the data was contained in the range you were querying for. My problem is I want the previous score even if it lies outside the date range we are looking at.
示例数据:
INSERT INTO score (id, widget_id, for_date, score) values
(1, 1337, '2012-04-07', 52),
(2, 2222, '2012-05-05', 99),
(3, 1337, '2012-05-07', 112),
(4, 2222, '2012-05-07', 101);
当我查询2012年5月5日至5月10日的范围时(即 generate_series ('2012-05-05':: date,'2012-05-10':: date,'1d')
)我想得到以下内容:
When I query for the range May 5th to May 10th 2012 (ie generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')
) I would like to get the following:
DAY WIDGET_ID SCORE
May, 05 2012 1337 52
May, 05 2012 2222 99
May, 06 2012 1337 52
May, 06 2012 2222 99
May, 07 2012 1337 112
May, 07 2012 2222 101
May, 08 2012 1337 112
May, 08 2012 2222 101
May, 09 2012 1337 112
May, 09 2012 2222 101
May, 10 2012 1337 112
May, 10 2012 2222 101
到目前为止最好的解决方案(也是@Erwin的)是:
The best solution so far (also by @Erwin) is:
SELECT a.day, a.widget_id, s.score
FROM (
SELECT d.day, w.widget_id
,max(s.for_date) OVER (PARTITION BY w.widget_id ORDER BY d.day) AS effective_date
FROM (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d
CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
) a
LEFT JOIN score s ON s.for_date = a.effective_date AND s.widget_id = a.widget_id
ORDER BY a.day, a.widget_id;
但是您可以在 SQL提琴,它在前两天为小部件1337生成空分数。我想从第1行的位置看到更早的52分。
But as you can see in this SQL Fiddle it produces null scores for widget 1337 on the first two days. I would like to see the earlier score of 52 from row 1 in its place.
是否可以有效地做到这一点?
Is it possible to do this in an efficient way?
推荐答案
以 @Roman提及, DISTINCT ON
可以解决此问题。此相关答案中的详细信息:
As @Roman mentioned, DISTINCT ON
can solve this. Details in this related answer:
- Select first row in each GROUP BY group?
子查询通常比CTE快一点。
Subqueries are generally a bit faster than CTEs, though:
SELECT DISTINCT ON (d.day, w.widget_id)
d.day, w.widget_id, s.score
FROM generate_series('2012-05-05'::date, '2012-05-10'::date, '1d') d(day)
CROSS JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT JOIN score s ON s.widget_id = w.widget_id AND s.for_date <= d.day
ORDER BY d.day, w.widget_id, s.for_date DESC;
您可以在 FROM中使用返回表之类的集合返回函数
列表。
一个多列索引应该是性能的关键:
One multicolumn index should be the key to performance:
CREATE INDEX score_multi_idx ON score (widget_id, for_date, score)
第三列得分
仅包含在Postgres 9.2中,使其成为覆盖索引或更高版本。
The third column score
is only included to make it a covering index in Postgres 9.2 or later. You would not include it in earlier versions.
当然,如果您有许多小部件且工作日很长,则 CROSS JOIN
产生很多行,并带有价格标签。仅选择您实际需要的小部件和日期。
Of course, if you have many widgets and a wide range of days, the CROSS JOIN
produces a lot of rows, which has a price-tag. Only select the widgets and days you actually need.
这篇关于Postgres中的时间序列查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!