如何递归查找两个表之间的相交地理 [英] How to find intersecting geographies between two tables recursively
问题描述
我正在运行Postgres 9.6.1和PostGIS 2.3.0 r15146,并且有两个表.地理位置
可能有150,000,000行,路径
可能有10,000,000行:
CREATE TABLE路径(id uuid NOT NULL,路径路径NOT NULL,PRIMARY KEY(id))CREATE TABLE地域(id uuid非空,地理地理非空,PRIMARY KEY(id))
给出表 geographyies
的 ids
的数组/集合,查找所有相交路径和几何的最佳"方法是什么?
换句话说,如果初始的地理位置
具有相应的相交的路径
,我们还需要查找所有 other 地理位置此
path
相交的code>.从那里,我们需要找到这些新发现的 geographyies
相交的所有其他 paths
,依此类推,直到找到所有可能的交点为止.
最初的地理ID(我们的输入)可能在0到700之间.平均约为40.
最小交集为0,最大交集为1000.平均交集可能约为20,通常少于100个.
我已经创建了一个执行此操作的功能,但是我对PostGIS和一般的Postgres中的GIS还是陌生的.我已经发布了我的解决方案,作为对这个问题.
我觉得应该比我自己想出的方法更加雄辩,更快.
您的函数可以是 彻底 简化.
设置
我建议您将列 paths.path
转换为数据类型 geography
(或至少是 geometry
). 路径
是一个原生Postgres类型,不能与PostGIS功能和空间索引配合使用.您将必须投射 path :: geometry
或 path :: geometry :: geography
( ST_Intersects()
.
我的答案基于这些经过修改的表格:
CREATE TABLE路径(id uuid主键,路径地理位置 NOT NULL);创建表格地理位置(id uuid主键,地理地理NOT NULL,fk_id文字NOT NULL);
两列的所有数据类型均为 geometry
.地理
通常更精确,但也更昂贵.使用哪个?在此处阅读PostGIS常见问题解答.
解决方案1:您的功能已优化
创建或替换功能public.function_name(_fk_ids text [])返回表(id uuid,键入文本)语言plpgsql AS$ func $宣布_row_ct int;_loop_ct int:= 0;开始在COMMIT DROP AS上创建TEMP TABLE _geo-在事务结束时删除SELECT DISTINCT ON(g.id)g.id,g.geography,_loop_ct AS loop_ct-可能会重复吗?来自地理位置g在哪里g.fk_id = ANY(_fk_ids);GET DIAGNOSTICS _row_ct = ROW_COUNT;如果_row_ct = 0 THEN-找不到行,立即返回空结果返回;-退出功能万一;在COMMIT DROP AS上创建TEMP TABLE _pathSELECT DISTINCT ON(p.id)p.id,p.path,_loop_ct AS loop_ct来自_geo gJOIN路径p ON ST_Intersects(g.geography,p.path);-还没有骗子GET DIAGNOSTICS _row_ct = ROW_COUNT;如果_row_ct = 0然后-找不到行,请立即返回_geo返回查询选择g.id,来自_geo g的文本"geo";返回;万一;ALTER TABLE _geo ADD CONSTRAINT g_uni UNIQUE(id);-UPSERT必需ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE(id);环形_loop_ct:= _loop_ct + 1;插入_geo(id,geography,loop_ct)SELECT DISTINCT ON(g.id)g.id,g.geography,_loop_ctFROM _paths p在ST_Intersects上加入地理(g.geography,p.path)其中p.loop_ct = _loop_ct-1-仅使用最后一轮!冲突时约束g_uni不需要;-消除新的骗子找不到时退出;插入到_path(id,path,loop_ct)选择DISTINCT ON(p.id)p.id,p.path,_loop_ct来自_geo gJOIN路径p ON ST_Intersects(g.geography,p.path)在哪里g.loop_ct = _loop_ct-1关于冲突p_uni禁止;禁止;禁止找不到时退出;结束循环;返回查询选择g.id,从_geo g输入文本"geo"全联盟SELECT p.id,从"_path p"输入文本"path";结尾$ func $;
致电:
SELECT * FROM public.function_name('{foo,bar}');
比您所拥有的要快得多.
要点
-
您基于整个集合查询,而不是仅基于集合的最新查询.每个循环都变得越来越慢,不需要.我添加了一个循环计数器(
loop_ct
)来避免重复工作. -
确保在
geographies.geography
和paths.path
上具有空间GiST 索引:使用GIST在地理上创建索引geo_geo_gix;CREATE INDEX path_path_gix使用GIST开启路径(路径);
自Postgres 9.5起 仅索引扫描将是GiST索引的一种选择.您可以添加 id
作为第二个索引列.好处取决于许多因素,您必须进行测试.但是,没有用于 uuid
类型的拟合运算符GiST类.安装扩展 btree_gist :
-
在
g.fk_id
上也具有合适的索引.同样,如果您可以从中获取仅索引的扫描,则(fk_id,id,地理位置)
上的多列索引可能会有所帮助.默认btree索引fk_id
必须是第一索引列.尤其是如果您经常运行查询并且很少更新表,并且表行比索引宽得多. -
您可以在声明时初始化变量.重写后只需要一次.
-
ON COMMIT DROP
会在事务结束时自动删除临时表.因此,我明确删除了删除表.但是,如果在 same 事务中两次调用该函数,则会出现异常.在此函数中,我将检查临时表是否存在,并在这种情况下使用TRUNCATE
.相关: -
使用 <代码>获取诊断 即可获取行计数,而不是对计数进行另一个查询.
您需要获取诊断
. CREATE TABLE
没有设置 FOUND
(如手册中所述).
-
在填充表之后,添加索引或PK/UNIQUE约束 更快.而不是在我们真正需要它之前.
从Postgres 9.5开始, -
ON CONFLICT ... DO ...
是UPSERT的更简单,更便宜的方法.
对于命令的简单形式,您只需列出索引列或表达式(例如 ON CONFLICT(id)DO ...
),然后让Postgres执行唯一索引推断以确定仲裁约束或索引.后来,我通过直接提供约束进行了优化.但是为此,我们需要一个实际的约束-唯一的索引是不够的.相应地修复.此处的手册中的详细信息. >
-
可能有助于手动
ANALYZE
临时表,以帮助Postgres找到最佳的查询计划.(但我认为您不需要这种情况.) -
_geo_ct-_geographyLength>0
是表示_geo_ct>_geographyLength
.但这已经完全消失了. -
不引用语言名称.只需
LANGUAGE plpgsql
. -
您的函数参数为
fk_id
的数组的varchar []
,但是您后来发表了评论:
这是一个
bigint
字段,代表一个地理区域(它实际上是15级的预计算s2cell
id).
我不知道级别15的 s2cell
id ,但理想情况下,您传递的是匹配数据类型的数组,或者如果不是默认选项,则为文字[]
.
也因为您评论了
总是有13个
fk_id
传入.
对于 VARIADIC
函数参数,这似乎是一个完美的用例.因此,您的函数定义为:
创建或替换功能public.function_name(_fk_ids VARIADIC text [])...
详细信息:
解决方案2:具有递归CTE的纯SQL
很难将 rCTE 环绕起来,但可能需要一些SQL技巧:
使用RECURSIVE cte AS(SELECT g.id,g.geography :: text,NULL :: text AS路径,文本'geo'AS类型来自地理位置gg.fk_id = ANY($ kf_ids)-您的输入数组在这里联盟选择p.id,g.geography :: text,p.path :: text,p.path为NULL时的情况,然后'geo'ELSE'path'END AS类型从cte c左联接路径p ON c.type ='geo'AND ST_Intersects(c.geography :: geography,p.path)左联接地理g ON c.type ='path'AND ST_Intersects(g.geography,c.path :: geography)在哪里(p.path不为空或g.geography不为空))SELECT id,输入FROM cte;
仅此而已.
您需要与上述相同的索引.您可以将其包装到SQL函数中以供重复使用.
其他主要要点
-
必须强制转换为
text
,因为geography
类型不是可散列"的.(与geometry
相同).(有关详细信息,请参见此公开的PostGIS问题.)通过强制转换为文本
.行仅凭借(id,type)
是唯一的,为此我们可以忽略geography
列.投射回地理位置
进行连接.不应花太多的钱. -
我们需要两个
LEFT JOIN
,以便不排除行,因为在每次迭代中,两个表中只有一个可能会贡献更多的行.
最终条件确保我们还没有完成:在哪里(p.path不为空或g.geography不为空)
之所以可行,是因为从临时项目中排除重复的发现中间表.手册:
对于
UNION
(但不是UNION ALL
),请丢弃重复的行和复制任何先前的结果行.将所有剩余的行包括在递归查询的结果,并将它们放在一个临时目录中中间表.
那哪个更快?
rCTE可能比小结果集的功能要快.函数中的临时表和索引意味着更多的开销.但是,对于较大的结果集,功能 可能会更快.只有使用您的实际设置进行测试,才能为您提供确定的答案.*
- 请参见 OP的评论中的反馈.
I'm running Postgres 9.6.1 and PostGIS 2.3.0 r15146 and have two tables.
geographies
may have 150,000,000 rows, paths
may have 10,000,000 rows:
CREATE TABLE paths (id uuid NOT NULL, path path NOT NULL, PRIMARY KEY (id))
CREATE TABLE geographies (id uuid NOT NULL, geography geography NOT NULL, PRIMARY KEY (id))
Given an array/set of ids
for table geographies
, what is the "best" way of finding all intersecting paths and geometries?
In other words, if an initial geography
has a corresponding intersecting path
we need to also find all other geographies
that this path
intersects. From there, we need to find all other paths
that these newly found geographies
intersect, and so on until we've found all possible intersections.
The initial geography ids (our input) may be anywhere from 0 to 700. With an average around 40.
Minimum intersections will be 0, max will be about 1000. Average likely around 20, typically less than 100 connected.
I've created a function that does this, but I'm new to GIS in PostGIS, and Postgres in general. I've posted my solution as an answer to this question.
I feel like there should be a more eloquent and faster way of doing this than what I've come up with.
Your function can be radically simplified.
Setup
I suggest you convert the column paths.path
to data type geography
(or at least geometry
). path
is a native Postgres type and does not play well with PostGIS functions and spatial indexes. You would have to cast path::geometry
or path::geometry::geography
(resulting in a LINESTRING
internally) to make it work with PostGIS functions like ST_Intersects()
.
My answer is based on these adapted tables:
CREATE TABLE paths (
id uuid PRIMARY KEY
, path geography NOT NULL
);
CREATE TABLE geographies (
id uuid PRIMARY KEY
, geography geography NOT NULL
, fk_id text NOT NULL
);
Everything works with data type geometry
for both columns just as well. geography
is generally more exact but also more expensive. Which to use? Read the PostGIS FAQ here.
Solution 1: Your function optimized
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids text[])
RETURNS TABLE(id uuid, type text)
LANGUAGE plpgsql AS
$func$
DECLARE
_row_ct int;
_loop_ct int := 0;
BEGIN
CREATE TEMP TABLE _geo ON COMMIT DROP AS -- dropped at end of transaction
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct AS loop_ct -- dupes possible?
FROM geographies g
WHERE g.fk_id = ANY(_fk_ids);
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return empty result immediately
RETURN; -- exit function
END IF;
CREATE TEMP TABLE _path ON COMMIT DROP AS
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct AS loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path); -- no dupes yet
GET DIAGNOSTICS _row_ct = ROW_COUNT;
IF _row_ct = 0 THEN -- no rows found, return _geo immediately
RETURN QUERY SELECT g.id, text 'geo' FROM _geo g;
RETURN;
END IF;
ALTER TABLE _geo ADD CONSTRAINT g_uni UNIQUE (id); -- required for UPSERT
ALTER TABLE _path ADD CONSTRAINT p_uni UNIQUE (id);
LOOP
_loop_ct := _loop_ct + 1;
INSERT INTO _geo(id, geography, loop_ct)
SELECT DISTINCT ON (g.id) g.id, g.geography, _loop_ct
FROM _paths p
JOIN geographies g ON ST_Intersects(g.geography, p.path)
WHERE p.loop_ct = _loop_ct - 1 -- only use last round!
ON CONFLICT ON CONSTRAINT g_uni DO NOTHING; -- eliminate new dupes
EXIT WHEN NOT FOUND;
INSERT INTO _path(id, path, loop_ct)
SELECT DISTINCT ON (p.id) p.id, p.path, _loop_ct
FROM _geo g
JOIN paths p ON ST_Intersects(g.geography, p.path)
WHERE g.loop_ct = _loop_ct - 1
ON CONFLICT ON CONSTRAINT p_uni DO NOTHING;
EXIT WHEN NOT FOUND;
END LOOP;
RETURN QUERY
SELECT g.id, text 'geo' FROM _geo g
UNION ALL
SELECT p.id, text 'path' FROM _path p;
END
$func$;
Call:
SELECT * FROM public.function_name('{foo,bar}');
Much faster than what you have.
Major points
You based queries on the whole set, instead of the latest additions to the set only. This gets increasingly slower with every loop without need. I added a loop counter (
loop_ct
) to avoid redundant work.Be sure to have spatial GiST indexes on
geographies.geography
andpaths.path
:CREATE INDEX geo_geo_gix ON geographies USING GIST (geography); CREATE INDEX paths_path_gix ON paths USING GIST (path);
Since Postgres 9.5 index-only scans would be an option for GiST indexes. You might add id
as second index column. The benefit depends on many factors, you'd have to test. However, there is no fitting operator GiST class for the uuid
type. It would work with bigint
after installing the extension btree_gist:
Have a fitting index on
g.fk_id
, too. Again, a multicolumn index on(fk_id, id, geography)
might pay if you can get index-only scans out of it. Default btree index,fk_id
must be first index column. Especially if you run the query often and rarely update the table and table rows are much wider than the index.You can initialize variables at declaration time. Only needed once after the rewrite.
ON COMMIT DROP
drops the temp tables at the end of the transaction automatically. So I removed dropping tables explicitly. But you get an exception if you call the function in the same transaction twice. In the function I would check for existence of the temp table and useTRUNCATE
in this case. Related:Use
GET DIAGNOSTICS
to get the row count instead of running another query for the count.
You need GET DIAGNOSTICS
. CREATE TABLE
does not set FOUND
(as is mentioned in the manual).
It's faster to add an index or PK / UNIQUE constraint after filling the table. And not before we actually need it.
ON CONFLICT ... DO ...
is the simpler and cheaper way for UPSERT since Postgres 9.5.How to UPSERT (MERGE, INSERT ... ON DUPLICATE UPDATE) in PostgreSQL?
For the simple form of the command you just list index columns or expressions (like ON CONFLICT (id) DO ...
) and let Postgres perform unique index inference to determine an arbiter constraint or index. I later optimized by providing the constraint directly. But for this we need an actual constraint - a unique index is not enough. Fixed accordingly. Details in the manual here.
It may help to
ANALYZE
temporary tables manually to help Postgres find the best query plan. (But I don't think you need it in your case.)_geo_ct - _geographyLength > 0
is an awkward and more expensive way of saying_geo_ct > _geographyLength
. But that's gone completely now.Don't quote the language name. Just
LANGUAGE plpgsql
.Your function parameter is
varchar[]
for an array offk_id
, but you later commented:
It is a
bigint
field that represents a geographic area (it's actually a precomputeds2cell
id at level 15).
I don't know s2cell
id at level 15, but ideally you pass an array of matching data type, or if that's not an option default to text[]
.
Also since you commented:
There are always exactly 13
fk_id
s passed in.
This seems like a perfect use case for a VARIADIC
function parameter. So your function definition would be:
CREATE OR REPLACE FUNCTION public.function_name(_fk_ids VARIADIC text[]) ...
Details:
Solution 2: Plain SQL with recursive CTE
It's hard to wrap an rCTE around two alternating loops, but possible with some SQL finesse:
WITH RECURSIVE cte AS (
SELECT g.id, g.geography::text, NULL::text AS path, text 'geo' AS type
FROM geographies g
WHERE g.fk_id = ANY($kf_ids) -- your input array here
UNION
SELECT p.id, g.geography::text, p.path::text
, CASE WHEN p.path IS NULL THEN 'geo' ELSE 'path' END AS type
FROM cte c
LEFT JOIN paths p ON c.type = 'geo'
AND ST_Intersects(c.geography::geography, p.path)
LEFT JOIN geographies g ON c.type = 'path'
AND ST_Intersects(g.geography, c.path::geography)
WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
)
SELECT id, type FROM cte;
That's all.
You need the same indexes as above. You might wrap it into an SQL function for repeated use.
Major additional points
The cast to
text
is necessary because thegeography
type is not "hashable" (same forgeometry
). (See this open PostGIS issue for details.) Work around it by casting totext
. Rows are unique by virtue of(id, type)
alone, we can ignore thegeography
columns for this. Cast back togeography
for the join. Shouldn't cost too much extra.We need two
LEFT JOIN
so not to exclude rows, because at each iteration only one of the two tables may contribute more rows.
The final condition makes sure we are not done, yet:WHERE (p.path IS NOT NULL OR g.geography IS NOT NULL)
This works because duplicate findings are excluded from the temporary intermediate table. The manual:
For
UNION
(but notUNION ALL
), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.
So which is faster?
The rCTE is probably faster than the function for small result sets. The temp tables and indexes in the function mean considerably more overhead. For large result sets the function may be faster, though. Only testing with your actual setup can give you a definitive answer.*
这篇关于如何递归查找两个表之间的相交地理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!