Postgres中多个字段的索引扫描松散? [英] Loose index scan in Postgres on more than one field?

查看:114
本文介绍了Postgres中多个字段的索引扫描松散?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Postgres 9.2中有几个大表(数百万行),在这些表中,我需要基于源(varchar)和 id(int)这两个字段的组合生成唯一的代码。我可以通过在以下结果上生成row_numbers来做到这一点:

I have several large tables in Postgres 9.2 (millions of rows) where I need to generate a unique code based on the combination of two fields, 'source' (varchar) and 'id' (int). I can do this by generating row_numbers over the result of:

SELECT source,id FROM tablename GROUP BY source,id

,但结果可能需要一段时间才能处理。建议,如果对字段进行索引,并且索引值的数量成比例地减少(这是我的情况),则松散的索引扫描可能是更好的选择: http://wiki.postgresql.org/wiki/Loose_indexscan

but the results can take a while to process. It has been recommended that if the fields are indexed, and there are a proportionally small number of index values (which is my case), that a loose index scan may be a better option: http://wiki.postgresql.org/wiki/Loose_indexscan

WITH RECURSIVE
     t AS (SELECT min(col) AS col FROM tablename
           UNION ALL
           SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
SELECT col FROM t WHERE col IS NOT NULL
UNION ALL
SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);

该示例在单个字段上运行。尝试返回多个字段会产生错误:子查询必须仅返回一列。一种可能是尝试检索整个ROW-例如 SELECT ROW(min(source),min(id)... ,但是然后我不确定WHERE语句的语法看起来像什么工作

The example operates on a single field though. Trying to return more than one field generates an error: subquery must return only one column. One possibility might be to try retrieving an entire ROW - e.g. SELECT ROW(min(source),min(id)..., but then I'm not sure what the syntax of the WHERE statement would need to look like to work with individual row elements.

问题是:是否可以修改基于递归的代码以处理多个列,如果可以,那么如何?我致力于使用Postgres,但看起来MySQL已对多个列实施了松散索引扫描: http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html

The question is: can the recursion-based code be modified to work with more than one column, and if so, how? I'm committed to using Postgres, but it looks like MySQL has implemented loose index scans for more than one column: http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html

根据建议,我附上我的EXPLAIN ANALYZE结果。

As recommended, I'm attaching my EXPLAIN ANALYZE results.

对于我的情况-我在使用GROUP BY 为2列选择不同的值的地方是:

For my situation - where I'm selecting distinct values for 2 columns using GROUP BY, it's the following:

 HashAggregate  (cost=1645408.44..1654099.65 rows=869121 width=34) (actual time=35411.889..36008.475 rows=1233080 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 width=34) (actual time=4413.311..25450.840 rows=22025768 loops=1)
 Total runtime: 36127.789 ms
(3 rows)

我不知道如何进行2列索引扫描(这是问题),但是出于目的相比之下,在一列上使用GROUP BY,我得到:

I don't know how to do a 2-column index scan (that's the question), but for purposes of comparison, using a GROUP BY on one column, I get:

 HashAggregate  (cost=1590346.70..1590347.69 rows=99 width=8) (actual time=32310.706..32310.722 rows=100 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..1535284.96 rows=22024696 width=8) (actual time=4764.609..26941.832 rows=22025768 loops=1)
 Total runtime: 32350.899 ms
(3 rows)

但是对于一列的松散索引扫描,我得到:

 Result  (cost=181.28..198.07 rows=101 width=8) (actual time=0.069..1.935 rows=100 loops=1)
   CTE t
     ->  Recursive Union  (cost=1.74..181.28 rows=101 width=8) (actual time=0.062..1.855 rows=101 loops=1)
           ->  Result  (cost=1.74..1.75 rows=1 width=0) (actual time=0.061..0.061 rows=1 loops=1)
                 InitPlan 1 (returns $1)
                   ->  Limit  (cost=0.00..1.74 rows=1 width=8) (actual time=0.057..0.057 rows=1 loops=1)
                         ->  Index Only Scan using tablename_id on tablename  (cost=0.00..38379014.12 rows=22024696 width=8) (actual time=0.055..0.055 rows=1 loops=1)
                               Index Cond: (id IS NOT NULL)
                               Heap Fetches: 0
           ->  WorkTable Scan on t  (cost=0.00..17.75 rows=10 width=8) (actual time=0.017..0.017 rows=1 loops=101)
                 Filter: (id IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 3
                   ->  Result  (cost=1.75..1.76 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=100)
                         InitPlan 2 (returns $3)
                           ->  Limit  (cost=0.00..1.75 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=100)
                                 ->  Index Only Scan using tablename_id on tablename  (cost=0.00..12811462.41 rows=7341565 width=8) (actual time=0.015..0.015 rows=1 loops=100)
                                       Index Cond: ((id IS NOT NULL) AND (id > t.id))
                                       Heap Fetches: 0
   ->  Append  (cost=0.00..16.79 rows=101 width=8) (actual time=0.067..1.918 rows=100 loops=1)
         ->  CTE Scan on t  (cost=0.00..2.02 rows=100 width=8) (actual time=0.067..1.899 rows=100 loops=1)
               Filter: (id IS NOT NULL)
               Rows Removed by Filter: 1
         ->  Result  (cost=13.75..13.76 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
               One-Time Filter: $5
               InitPlan 5 (returns $5)
                 ->  Index Only Scan using tablename_id on tablename  (cost=0.00..13.75 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
                       Index Cond: (id IS NULL)
                       Heap Fetches: 0
 Total runtime: 2.040 ms

完整表定义如下所示:

CREATE TABLE tablename
(
  source character(25),
  id bigint NOT NULL,
  time_ timestamp without time zone,
  height numeric,
  lon numeric,
  lat numeric,
  distance numeric,
  status character(3),
  geom geometry(PointZ,4326),
  relid bigint
)
WITH (
  OIDS=FALSE
);

CREATE INDEX tablename_height
  ON public.tablename
  USING btree
  (height);

CREATE INDEX tablename_geom
  ON public.tablename
  USING gist
  (geom);


CREATE INDEX tablename_id
  ON public.tablename
  USING btree
  (id);

CREATE INDEX tablename_lat
  ON public.tablename
  USING btree
  (lat);

CREATE INDEX tablename_lon
  ON public.tablename
  USING btree
  (lon);

CREATE INDEX tablename_relid
  ON public.tablename
  USING btree
  (relid);

CREATE INDEX tablename_sid
  ON public.tablename
  USING btree
  (source COLLATE pg_catalog."default", id);

CREATE INDEX tablename_source
  ON public.tablename
  USING btree
  (source COLLATE pg_catalog."default");

CREATE INDEX tablename_time
  ON public.tablename
  USING btree
  (time_);






答案选择:

我花了一些时间比较提供的方法。我有时希望这样的答案可以被接受,但在这种情况下,我要给@jjanes打勾。原因是他的解决方案与最初提出的问题更接近,而且我对所需的WHERE语句的形式也有了一些见识。最后,对于我来说,HashAggregate实际上是最快的方法,但这是由于我的数据的性质,而不是算法有任何问题。我为以下不同方法附加了EXPLAIN ANALYZE,并将为janes和joop都赋予+1。

I took some time in comparing the approaches that were provided. It's at times like this that I wish that more than one answer could be accepted, but in this case, I'm giving the tick to @jjanes. The reason for this is that his solution matches the question as originally posed more closely, and I was able to get some insights as to the form of the required WHERE statement. In the end, the HashAggregate is actually the fastest approach (for me), but that's due to the nature of my data, not any problems with the algorithms. I've attached the EXPLAIN ANALYZE for the different approaches below, and will be giving +1 to both jjanes and joop.

HashAggregate:

 HashAggregate  (cost=1018669.72..1029722.08 rows=1105236 width=34) (actual time=24164.735..24686.394 rows=1233080 loops=1)
   ->  Seq Scan on tablename  (cost=0.00..908548.48 rows=22024248 width=34) (actual time=0.054..14639.931 rows=22024982 loops=1)
 Total runtime: 24787.292 ms

宽松的索引扫描修改

CTE Scan on t  (cost=13.84..15.86 rows=100 width=112) (actual time=0.916..250311.164 rows=1233080 loops=1)
   Filter: (source IS NOT NULL)
   Rows Removed by Filter: 1
   CTE t
     ->  Recursive Union  (cost=0.00..13.84 rows=101 width=112) (actual time=0.911..249295.872 rows=1233081 loops=1)
           ->  Limit  (cost=0.00..0.04 rows=1 width=34) (actual time=0.910..0.911 rows=1 loops=1)
                 ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..965442.32 rows=22024248 width=34) (actual time=0.908..0.908 rows=1 loops=1)
                       Heap Fetches: 0
           ->  WorkTable Scan on t  (cost=0.00..1.18 rows=10 width=112) (actual time=0.201..0.201 rows=1 loops=1233081)
                 Filter: (source IS NOT NULL)
                 Rows Removed by Filter: 0
                 SubPlan 1
                   ->  Limit  (cost=0.00..0.05 rows=1 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
                         ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
                               Index Cond: (ROW(source, id) > ROW(t.source, t.id))
                               Heap Fetches: 0
                 SubPlan 2
                   ->  Limit  (cost=0.00..0.05 rows=1 width=34) (actual time=0.099..0.099 rows=1 loops=1233080)
                         ->  Index Only Scan using tablename_sid on tablename  (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.098..0.098 rows=1 loops=1233080)
                               Index Cond: (ROW(source, id) > ROW(t.source, t.id))
                               Heap Fetches: 0
 Total runtime: 250491.559 ms

合并反加入

Merge Anti Join  (cost=0.00..12099015.26 rows=14682832 width=42) (actual time=48.710..541624.677 rows=1233080 loops=1)
   Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
   Join Filter: (nx.time_ > src.time_)
   Rows Removed by Join Filter: 363464177
   ->  Index Only Scan using tablename_pkey on tablename src  (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=48.566..5064.551 rows=22024982 loops=1)
         Heap Fetches: 0
   ->  Materialize  (cost=0.00..1115255.89 rows=22024248 width=42) (actual time=0.011..40551.997 rows=363464177 loops=1)
         ->  Index Only Scan using tablename_pkey on tablename nx  (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=0.008..8258.890 rows=22024982 loops=1)
               Heap Fetches: 0
 Total runtime: 541750.026 ms


推荐答案

有点丑陋,但这似乎工作:

Rather hideous, but this seems to work:

WITH RECURSIVE
     t AS (
  select a,b from (select a,b from foo order by a,b limit 1) asdf union all 
  select (select a from foo where (a,b) > (t.a,t.b) order by a,b limit 1),
         (select b from foo where (a,b) > (t.a,t.b) order by a,b limit 1) 
     from t where t.a is not null)  
select * from t where t.a is not null;

我真的不明白为什么需要不为空,而在哪里排在第一位?

I don't really understand why the "is not nulls" are needed, as where do the nulls come from in the first place?

这篇关于Postgres中多个字段的索引扫描松散?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆