LATERAL JOIN不使用trigram索引 [英] LATERAL JOIN not using trigram index

查看:151
本文介绍了LATERAL JOIN不使用trigram索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用Postgres对地址进行一些基本的地理编码。我有一个大约有100万个原始地址字符串的地址表:

I want to do some basic geocoding of addresses using Postgres. I have an address table that has around 1 million raw address strings:

=> \d addresses
  Table "public.addresses"
 Column  | Type | Modifiers
---------+------+-----------
 address | text |

我还有一张位置数据表:

I also have a table of location data:

=> \d locations
   Table "public.locations"
   Column   | Type | Modifiers
------------+------+-----------
 id         | text |
 country    | text |
 postalcode | text |
 latitude   | text |
 longitude  | text |

大多数地址字符串包含邮政编码,所以我的第一次尝试是做类似的和横向连接:

Most of the address strings contain postalcodes, so my first attempt was to do a like and a lateral join:

EXPLAIN SELECT * FROM addresses a
JOIN LATERAL (
    SELECT * FROM locations
    WHERE address ilike '%' || postalcode || '%'
    ORDER BY LENGTH(postalcode) DESC
    LIMIT 1
) AS l ON true;

这给出了预期的结果,但速度很慢。这是查询计划:

That gave the expected result, but it was slow. Here's the query plan:

                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Nested Loop  (cost=18383.07..18540688323.77 rows=1008572 width=91)
   ->  Seq Scan on addresses a  (cost=0.00..20997.72 rows=1008572 width=56)
   ->  Limit  (cost=18383.07..18383.07 rows=1 width=35)
         ->  Sort  (cost=18383.07..18391.93 rows=3547 width=35)
               Sort Key: (length(locations.postalcode))
               ->  Seq Scan on locations  (cost=0.00..18365.33 rows=3547 width=35)
                     Filter: (a.address ~~* (('%'::text || postalcode) || '%'::text))

我尝试将gist trigram索引添加到地址列,如 https://stackoverflow.com/a/13452528/36191 ,但上述查询的查询计划未使用它,并且查询计划保持不变。

I tried adding a gist trigram index to the address column, like mentioned at https://stackoverflow.com/a/13452528/36191, but the query plan for the above query doesn't make use of it, and the query plan in unchanged.

CREATE INDEX idx_address ON addresses USING gin (address gin_trgm_ops);

我必须删除横向连接查询的顺序和限制才能使用索引,哪个不给我想要的结果。这是没有 ORDER LIMIT 的查询的查询计划:

I have to remove the order by and limit in the lateral join query for the index to get used, which doesn't give me the results I want. Here's the query plan for the query without ORDER or LIMIT:

                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Nested Loop  (cost=39.35..129156073.06 rows=3577682241 width=86)
   ->  Seq Scan on locations  (cost=0.00..12498.55 rows=709455 width=28)
   ->  Bitmap Heap Scan on addresses a  (cost=39.35..131.60 rows=5043 width=58)
         Recheck Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))
         ->  Bitmap Index Scan on idx_address  (cost=0.00..38.09 rows=5043 width=0)
               Index Cond: (address ~~* (('%'::text || locations.postalcode) || '%'::text))

我能做些什么来让查询使用索引,或者有没有更好的方法来重写这个查询?

Is there something I can do to get the query to use the index, or is there a better way to rewrite this query?

推荐答案

为什么?



查询不能使用主体索引。您需要表 locations 上的索引,但您所拥有的索引位于表地址

Why?

The query cannot use the index on principal. You would need an index on the table locations, but the one you have is on the table addresses.

您可以通过设置来验证我的声明:

You can verify my claim by setting:

SET enable_seqscan = off;

(仅在您的会话中,仅用于调试。切勿在生产中使用它。)它不喜欢索引比顺序扫描更昂贵,Postgres根本没办法将它用于你的查询

(In your session only, and for debugging only. Never use it in production.) It's not like the index would be more expensive than a sequential scan, there is just no way for Postgres to use it for your query at all.

除了: [INNER] JOIN ... ON true 只是一种尴尬的说法 CROSS JOIN ...

Aside: [INNER] JOIN ... ON true is just an awkward way of saying CROSS JOIN ...

因为Postgres可以将这个简单的表格重写为:

Because Postgres can rewrite this simple form to:

SELECT *
FROM   addresses a
JOIN   locations l ON a.address ILIKE '%' || l.postalcode || '%';

您将看到完全相同的查询计划。 (至少我在Postgres 9.5的测试中做了。)

You'll see the exact same query plan. (At least I do in my tests on Postgres 9.5.)

你需要一个索引 locations.postalcode 。在使用 LIKE ILIKE 时,您还需要带上索引表达式( postalcode )到运算符的 left 侧。运算符 ~~ * 实现 ILIKE ,且此运算符没有 COMMUTATOR (逻辑必然性),因此无法翻转操作数。这些相关答案中的详细解释:

You need an index on locations.postalcode. And while using LIKE or ILIKE you would also need to bring the indexed expression (postalcode) to the left side of the operator. ILIKE is implemented with the operator ~~* and this operator has no COMMUTATOR (a logical necessity), so it's not possible to flip operands around. Detailed explanation in these related answers:

  • Can PostgreSQL index array columns?
  • PostgreSQL - text Array contains value similar to
  • Is there a way to usefully index a text column containing regex patterns?

解决方案是使用三元组相似度运算符 或其反转,最近邻居查询中的eferrer>距离运算符 < - > (每个都是换向器本身,所以操作数可以自由切换位置):

A solution is to use the trigram similarity operator % or its inverse, the distance operator <-> in a nearest neighbour query instead (each is commutator for itself, so operands can switch places freely):

SELECT *
FROM   addresses a
JOIN   LATERAL (
   SELECT *
   FROM   locations
   ORDER  BY postalcode <-> a.address
   LIMIT  1
   ) l ON address ILIKE '%' || postalcode || '%';

查找最相似的邮政编码对于每个地址,然后检查邮政编码是否实际完全匹配。

Find the most similar postalcode for each address, and then check if that postalcode actually matches fully.

这样,较长的邮政编码将自动首选,因为它比较短的邮政编码更相似(更小的距离) 也匹配。

This way, a longer postalcode will be preferred automatically since it's more similar (smaller distance) than a shorter postalcode that also matches.

还有一些不确定性。根据可能的邮政编码,由于字符串其他部分中的三元组匹配,可能会出现误报。问题中没有足够的信息可以说明更多信息。

A bit of uncertainty remains. Depending on possible postal codes, there could be false positives due to matching trigrams in other parts of the string. There is not enough information in the question to say more.

此处 [INNER] JOIN 而不是 CROSS JOIN 是有意义的,因为我们添加了一个实际的连接条件。

Here, [INNER] JOIN instead of CROSS JOIN makes sense, since we add an actual join condition.

手册:

这可以通过GiST索引非常有效地实现,但不能通过GIN索引实现。

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes.

所以:

CREATE INDEX locations_postalcode_trgm_gist_idx ON locations
USING gist (postalcode gist_trgm_ops);

这篇关于LATERAL JOIN不使用trigram索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆