如何获取由单独的列排序的DISTINCT ON子查询,并使其快速进行? [英] How do I take a DISTINCT ON subquery that is ordered by a separate column, and make it fast?
问题描述
(又称为-查询和数据非常类似于问题 选择按某列排序且在另一列上不同的行,如何使它快速运行)。 Postgres 11。
(AKA - With a query and data very similar to question "Selecting rows ordered by some column and distinct on another", how can I get it to run fast). Postgres 11.
我的表预测
与(article_id,prediction_date,predicted_as,article_published_date )
,代表分类器对一组文章的输出。
I have table prediction
with (article_id, prediction_date, predicted_as, article_published_date)
that represents the output from a classifier over a set of articles.
新文章经常添加到单独的表格中(由FK表示) article_id
),并在调整分类器时添加新的预测。
New articles are frequently added to a separate table (Represented by the FK article_id
), and new predictions are added as we tune our classifier.
样本数据:
| id | article_id | predicted_as | prediction_date | article_published_date
| 1009381 | 362718 | negative | 2018-07-27 | 2018-06-26
| 1009382 | 362718 | positive | 2018-08-12 | 2018-06-26
| 1009383 | 362719 | positive | 2018-08-13 | 2010-09-22
| 1009384 | 362719 | positive | 2018-09-28 | 2010-09-22
| 1009385 | 362719 | negative | 2018-10-01 | 2010-09-22
创建表格脚本:
create table prediction
(
id serial not null
constraint prediction_pkey
primary key,
article_id integer not null
constraint prediction_article_id_fkey
references article,
predicted_as classifiedas not null,
prediction_date date not null,
article_published_date date not null
);
create index prediction_article_id_prediction_date_idx
on prediction (article_id asc, prediction_date desc);
我们经常想查看每篇文章的最新分类。为了做到这一点,我们使用:
We frequently want to view the most recent classification for each article. In order to do so we use:
SELECT DISTINCT ON (article_id) article_id, id, article_published_date
FROM prediction
ORDER BY article_id, prediction_date desc
会返回以下内容:
| id | article_id | predicted_as | prediction_date | article_published_date
| 120950 | 1 | negative | 2018-06-29 | 2018-03-25
| 120951 | 2 | negative | 2018-06-29 | 2018-03-19
索引为(article_id,prediciton_date desc)
,此查询运行速度非常快(〜15ms)。这是一个解释计划:
With an index on (article_id, prediciton_date desc)
, this query runs very quickly (~15ms). This is the explain plan:
Unique (cost=0.56..775374.53 rows=1058394 width=20)
-> Index Scan using prediction_article_id_prediction_date_id_idx on prediction (cost=0.56..756071.98 rows=7721023 width=20)
因此
当我想按article_published_field对结果进行排序时,会出现问题。例如:
The problem occurs when I want to sort this result by the article_published_field. E.g:
explain (analyze, buffers)
select *
from (
select distinct on (article_id) article_id, id, article_published_date
from prediction
order by article_id, prediction_date desc
) most_recent_predictions
order by article_published_date desc
limit 3;
此方法有效,但是查询需要运行约3-4秒,因此使用起来太慢
This works, but the query takes ~3-4 seconds to run, making it too slow to use directly to respond to a web request.
这是解释计划:
Limit (cost=558262.52..558262.53 rows=3 width=12) (actual time=4748.977..4748.979 rows=3 loops=1)
Buffers: shared hit=7621849 read=9051
-> Sort (cost=558262.52..560851.50 rows=1035593 width=12) (actual time=4748.975..4748.976 rows=3 loops=1)
Sort Key: most_recent_predictions.article_published_date DESC
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=7621849 read=9051
-> Subquery Scan on most_recent_predictions (cost=0.43..544877.67 rows=1035593 width=12) (actual time=0.092..4508.464 rows=1670807 loops=1)
Buffers: shared hit=7621849 read=9051
-> Result (cost=0.43..534521.74 rows=1035593 width=16) (actual time=0.092..4312.916 rows=1670807 loops=1)
Buffers: shared hit=7621849 read=9051
-> Unique (cost=0.43..534521.74 rows=1035593 width=16) (actual time=0.090..4056.644 rows=1670807 loops=1)
Buffers: shared hit=7621849 read=9051
-> Index Scan using prediction_article_id_prediction_date_idx on prediction (cost=0.43..515295.09 rows=7690662 width=16) (actual time=0.089..3248.250 rows=7690662 loops=1)
Buffers: shared hit=7621849 read=9051
Planning Time: 0.130 ms
Execution Time: 4749.007 ms
是是否有任何方法可以使此查询更快地运行,还是我不得不依靠刷新物化视图或设置触发器系统来快速获取此数据?
Is there any way to make this query run more quickly, or will I have to resort to refreshing a materialized view or setting up a trigger system to get this data quickly?
对于参考:
-
预测
表具有770万行 -
预测
表中有170万个不同的article_id
- 在
(article_id,prediciton_date desc)
上有一个索引,在article_published_date desc
上有一个索引> -
VACUUM ANALYSE
已运行
- the
prediction
table has 7.7M rows - there are 1.7M distinct
article_id
s in theprediction
table - there is an index on
(article_id, prediciton_date desc)
as well as one onarticle_published_date desc
VACUUM ANALYSE
has been run
推荐答案
我想知道是否可以使这项工作:
I wonder if you can make this work:
select article_id, id, article_published_date
from prediction p
where p.prediction_date = (select max(p2.prediction_date)
from prediction p2
where p2.article_id = p.article_id
)
order by article_published_date desc;
然后使用这两个索引:
-
(article_published_date desc,prediction_date,article_id,id)
-
(article_id, projection_date desc)
。
(article_published_date desc, prediction_date, article_id, id)
(article_id, prediction_date desc)
.
这篇关于如何获取由单独的列排序的DISTINCT ON子查询,并使其快速进行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!