在postgres上缓慢选择不同的查询 [英] Slow select distinct query on postgres
问题描述
我经常在本质上收集日志信息的表上进行以下两个查询。两者都从大量的行中选择了不同的值,但是这些行中的值少于10个。
I'm doing the following two queries quite frequently on a table that essentially gathers up logging information. Both select distinct values from a huge number of rows but with less than 10 different values in those.
我已经分析了页面完成的两个不同查询: / p>
I've analyzed both "distinct" queries done by the page:
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070734.05..1070734.11 rows=6 width=21)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023050.24 rows=19073524 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
----------------------------------------------------------------------------------------------
HashAggregate (cost=1070735.34..1070735.39 rows=5 width=13)
-> Seq Scan on audit_records auditrecor0_ (cost=0.00..1023051.47 rows=19073547 width=13)
(2 rows)
都对列进行顺序扫描。但是,如果我关闭了enable_seqscan(尽管名称只禁用了对具有索引的列的序列扫描),则查询将使用索引,但速度会更慢:
Both do sequence scans of the columns. However if I turn off enable_seqscan (dispite the name this only disables doing sequence scans on columns with indices) the query uses the index, but is even slower:
marchena=> set enable_seqscan = off;
SET
marchena=> explain select distinct auditrecor0_.bundle_id as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..19613740.62 rows=6 width=21)
-> Index Scan using audit_bundle_idx on audit_records auditrecor0_ (cost=0.00..19566056.69 rows=19073570 width=21)
(2 rows)
marchena=> explain select distinct auditrecor0_.server_name as col_0_0_ from audit_records auditrecor0_;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.00..45851449.96 rows=5 width=13)
-> Index Scan using audit_server_idx on audit_records auditrecor0_ (cost=0.00..45803766.04 rows=19073570 width=13)
(2 rows)
bundle_id和server_name列上都有btree索引,我应该使用其他类型的索引来快速选择不同的值吗?
Both bundle_id and server_name columns have btree indices on them, should I be using a different type of index to make selecting distinct values fast?
推荐答案
BEGIN;
CREATE TABLE dist ( x INTEGER NOT NULL );
INSERT INTO dist SELECT random()*50 FROM generate_series( 1, 5000000 );
COMMIT;
CREATE INDEX dist_x ON dist(x);
VACUUM ANALYZE dist;
EXPLAIN ANALYZE SELECT DISTINCT x FROM dist;
HashAggregate (cost=84624.00..84624.51 rows=51 width=4) (actual time=1840.141..1840.153 rows=51 loops=1)
-> Seq Scan on dist (cost=0.00..72124.00 rows=5000000 width=4) (actual time=0.003..573.819 rows=5000000 loops=1)
Total runtime: 1848.060 ms
PG不能(但)使用索引来区分(跳过相同的值),但是您可以这样做:
PG can't (yet) use an index for distinct (skipping the identical values) but you can do this :
CREATE OR REPLACE FUNCTION distinct_skip_foo()
RETURNS SETOF INTEGER
LANGUAGE plpgsql STABLE
AS $$
DECLARE
_x INTEGER;
BEGIN
_x := min(x) FROM dist;
WHILE _x IS NOT NULL LOOP
RETURN NEXT _x;
_x := min(x) FROM dist WHERE x > _x;
END LOOP;
END;
$$ ;
EXPLAIN ANALYZE SELECT * FROM distinct_skip_foo();
Function Scan on distinct_skip_foo (cost=0.00..260.00 rows=1000 width=4) (actual time=1.629..1.635 rows=51 loops=1)
Total runtime: 1.652 ms
这篇关于在postgres上缓慢选择不同的查询的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!