通过坐标查询花费的时间太长-要优化的选项? [英] Query by coordinates takes too long - options to optimize?

查看:84
本文介绍了通过坐标查询花费的时间太长-要优化的选项?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表来存储事件(目前大约有5M,但还会更多).每个事件都有我要查询的两个属性-location(纬度和经度对)和relevancy.

我的目标是:对于给定的位置范围(SW/NE纬度/经度对,因此有4个浮点数)按relevancy返回位于该范围内的前100个事件. /p>

我当前正在使用以下查询:

select * 
from event 
where latitude >= :swLatitude 
and latitude <= :neLatitude 
and longitude >= :swLongitude 
and longitude <= :neLongitude 
order by relevancy desc 
limit 100

暂时搁置此查询无法处理的日期线回绕问题.

这在较小的位置范围内效果很好,但是每当我尝试使用较大的位置范围时,都会出现严重的滞后现象.

我已经定义了以下索引:

CREATE INDEX latitude_longitude_relevancy_index
  ON event
  USING btree
  (latitude, longitude, relevancy);

表本身非常简单:

CREATE TABLE event
(
  id uuid NOT NULL,
  relevancy double precision NOT NULL,
  data text,
  latitude double precision NOT NULL,
  longitude double precision NOT NULL
  CONSTRAINT event_pkey PRIMARY KEY (id)
)

我尝试了explain analyze并得到了以下内容,我认为这甚至没有使用该索引:

"Limit  (cost=1045499.02..1045499.27 rows=100 width=1249) (actual time=14842.560..14842.575 rows=100 loops=1)"
"  ->  Sort  (cost=1045499.02..1050710.90 rows=2084754 width=1249) (actual time=14842.557..14842.562 rows=100 loops=1)"
"        Sort Key: relevancy"
"        Sort Method: top-N heapsort  Memory: 351kB"
"        ->  Seq Scan on event  (cost=0.00..965821.22 rows=2084754 width=1249) (actual time=3090.660..12525.695 rows=1983213 loops=1)"
"              Filter: ((latitude >= 0::double precision) AND (latitude <= 180::double precision) AND (longitude >= 0::double precision) AND (longitude <= 180::double precision))"
"              Rows Removed by Filter: 3334584"
"Total runtime: 14866.532 ms"

我在Win7上使用的是PostgreSQL 9.3,因此似乎很难将其转移到其他任何地方来完成这项看似简单的任务.

问题:

  • 有什么方法可以使用不同的索引来帮助当前查询更快?
  • 有什么方法可以更快地重写当前查询?
  • 最正确的方法是什么?安装PostGIS并使用GEOGRAPHY数据类型?那实际上会为我现在所做的工作带来性能上的好处吗?哪种PostGIS功能最适合此查询?

编辑#1:vacuum full analyze的结果:

INFO:  vacuuming "public.event"
INFO:  "event": found 0 removable, 5397347 nonremovable row versions in 872213 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU 17.73s/11.84u sec elapsed 154.24 sec.
INFO:  analyzing "public.event"
INFO:  "event": scanned 30000 of 872213 pages, containing 185640 live rows and 0 dead     rows; 30000 rows in sample, 5397344 estimated total rows
Total query runtime: 360092 ms.

真空后的结果:

"Limit  (cost=1058294.92..1058295.17 rows=100 width=1216) (actual time=6784.111..6784.121 rows=100 loops=1)"
"  ->  Sort  (cost=1058294.92..1063405.89 rows=2044388 width=1216) (actual time=6784.109..6784.113 rows=100 loops=1)"
"        Sort Key: relevancy"
"        Sort Method: top-N heapsort  Memory: 203kB"
"        ->  Seq Scan on event  (cost=0.00..980159.88 rows=2044388 width=1216) (actual time=0.043..6412.570 rows=1983213 loops=1)"
"              Filter: ((latitude >= 0::double precision) AND (latitude <= 180::double precision) AND (longitude >= 0::double precision) AND (longitude <= 180::double precision))"
"              Rows Removed by Filter: 3414134"
"Total runtime: 6784.170 ms"

解决方案

使用使用R树的空间索引(本质上是二维索引,它通过将空间分成多个框来进行操作)会更好. ,并且在这种查询的两个单独的纬度和经度值上的比较结果要好于大于,小于.不过,您将需要首先创建一个几何类型,然后将其编入索引并在查询中使用,而不是当前使用的单独的纬度/经度对.

下面的代码将创建一个几何类型,将其填充并为其添加索引,以确保它是一个点并且以经度/纬度表示,称为EPSG:4326

alter table event add column geom geometry(POINT, 4326);
update event set geom=ST_SetSrid(ST_MakePoint(lon, lat), 4326);
create index ix_spatial_event_geom on event using gist(geom);

然后,您可以运行以下查询来获取事件,该事件将使用空间相交,而空间相交应利用您的空间索引:

Select * from events where ST_Intersects(ST_SetSRID(ST_MakeBox2D(ST_MakePoint(swLon, swLat), 
    ST_MakePoint(neLon, neLat)),4326), geom) 
order by relevancy desc limit 100;

您可以通过使用ST_MakeBOX2D和两组点(位于边界框的对角线上)来为相交创建边界框,因此SW和NE或NW和SE对都将起作用.

对此进行解释时,应该发现其中包括空间索引.这将比lon和lat列上的两个单独索引好得多,因为您只会命中一个针对空间搜索而优化的索引,而不是两个B树.我意识到这代表了另一种方式,除了间接地,它没有回答您最初的问题.

编辑:Mike T指出,对于4326中的边界框搜索,使用几何数据类型和&&操作符作为SRID,无论如何都会被忽略,例如,

 where ST_MakeBox2D(ST_MakePoint(swLon, swLat), ST_MakePoint(neLon, neLat)) && geom

I have a table where I store events (about 5M at the moment, but there will be more). Each event has two attributes that I care about for this query - location (latitude and longitude pair) and relevancy.

My goal is: For a given location bounds (SW / NE latitude/longitude pairs, so 4 floating point numbers) return the top 100 events by relevancy which fall within those bounds.

I'm currently using the following query:

select * 
from event 
where latitude >= :swLatitude 
and latitude <= :neLatitude 
and longitude >= :swLongitude 
and longitude <= :neLongitude 
order by relevancy desc 
limit 100

Let's put aside for the moment the issue of date-line wrap-around which this query doesn't handle.

This works fine for smaller location bounds, but lags rather badly whenever I try to use larger location bounds.

I've defined the following index:

CREATE INDEX latitude_longitude_relevancy_index
  ON event
  USING btree
  (latitude, longitude, relevancy);

The table itself is quite straightforward:

CREATE TABLE event
(
  id uuid NOT NULL,
  relevancy double precision NOT NULL,
  data text,
  latitude double precision NOT NULL,
  longitude double precision NOT NULL
  CONSTRAINT event_pkey PRIMARY KEY (id)
)

I tried explain analyze and got the following, which I think means the index isn't even used:

"Limit  (cost=1045499.02..1045499.27 rows=100 width=1249) (actual time=14842.560..14842.575 rows=100 loops=1)"
"  ->  Sort  (cost=1045499.02..1050710.90 rows=2084754 width=1249) (actual time=14842.557..14842.562 rows=100 loops=1)"
"        Sort Key: relevancy"
"        Sort Method: top-N heapsort  Memory: 351kB"
"        ->  Seq Scan on event  (cost=0.00..965821.22 rows=2084754 width=1249) (actual time=3090.660..12525.695 rows=1983213 loops=1)"
"              Filter: ((latitude >= 0::double precision) AND (latitude <= 180::double precision) AND (longitude >= 0::double precision) AND (longitude <= 180::double precision))"
"              Rows Removed by Filter: 3334584"
"Total runtime: 14866.532 ms"

I'm using PostgreSQL 9.3 on Win7 and it seems like overkill to move to anything else for this seemingly simple task.

Questions:

  • Any ways to use different indexes to help the current query be faster?
  • Any ways to rewrite the current query to be faster?
  • What's the easiest way to do this right? Install PostGIS and use the GEOGRAPHYdata type? Is that going to actually provide a performance benefit to what I'm doing now? Which PostGIS function would work best for this query?

Edit #1: Results for vacuum full analyze:

INFO:  vacuuming "public.event"
INFO:  "event": found 0 removable, 5397347 nonremovable row versions in 872213 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU 17.73s/11.84u sec elapsed 154.24 sec.
INFO:  analyzing "public.event"
INFO:  "event": scanned 30000 of 872213 pages, containing 185640 live rows and 0 dead     rows; 30000 rows in sample, 5397344 estimated total rows
Total query runtime: 360092 ms.

Results after the vacuum:

"Limit  (cost=1058294.92..1058295.17 rows=100 width=1216) (actual time=6784.111..6784.121 rows=100 loops=1)"
"  ->  Sort  (cost=1058294.92..1063405.89 rows=2044388 width=1216) (actual time=6784.109..6784.113 rows=100 loops=1)"
"        Sort Key: relevancy"
"        Sort Method: top-N heapsort  Memory: 203kB"
"        ->  Seq Scan on event  (cost=0.00..980159.88 rows=2044388 width=1216) (actual time=0.043..6412.570 rows=1983213 loops=1)"
"              Filter: ((latitude >= 0::double precision) AND (latitude <= 180::double precision) AND (longitude >= 0::double precision) AND (longitude <= 180::double precision))"
"              Rows Removed by Filter: 3414134"
"Total runtime: 6784.170 ms"

解决方案

You will be much better off using a spatial index which uses an R-tree (essentially a two-dimensional index, which operates by dividing the space into boxes), and will perform far better than greater than, less than comparisons on two separate lat, lon values on this kind of query. You will need to create a geometry type first though, which you then index and use in your query instead of the separate lat/lon pairs you are currently using.

The following will create a geometry type, populate it, and add an index to it, ensuring that it is a point and in lat/lon, known as EPSG:4326

alter table event add column geom geometry(POINT, 4326);
update event set geom=ST_SetSrid(ST_MakePoint(lon, lat), 4326);
create index ix_spatial_event_geom on event using gist(geom);

Then you can run the following query to get your events, which will use a spatial intersects, which should utilize your spatial index:

Select * from events where ST_Intersects(ST_SetSRID(ST_MakeBox2D(ST_MakePoint(swLon, swLat), 
    ST_MakePoint(neLon, neLat)),4326), geom) 
order by relevancy desc limit 100;

You make the bounding box for your intersection by using ST_MakeBOX2D with two sets of points, which will be on diagonal corners of the bounding box, so the SW and NE or the NW and SE pairs, would both work.

When you run explain on this, you should find that the spatial index is included. This will perform vastly better than two separate indexes on lon and lat columns, as you are only hitting one indexed, optimized for spatial search, rather than two B-trees. I realize that this represents another way of doing it and does not answer you original question, except indirectly.

EDIT: Mike T has made the very good point that for bounding box searches in 4326, it is more appropriate and quicker to use a geometry data type and the && operator as the SRID will be ignored anyway, eg,

 where ST_MakeBox2D(ST_MakePoint(swLon, swLat), ST_MakePoint(neLon, neLat)) && geom

这篇关于通过坐标查询花费的时间太长-要优化的选项?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆