通过相似性Postgres模糊自连接查询提高性能 [英] Improving performance with a Similarity Postgres fuzzy self join query

查看:102
本文介绍了通过相似性Postgres模糊自连接查询提高性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试运行一个查询,它将一个表与自己相连,并进行模糊字符串比较(使用三元组比较)来查找可能的公司名称匹配。我的目标是返回一条记录的公司名称(ref_name字段)与其他记录的公司名称匹配的三角形相似度的记录。目前,我的阈值设置为0.9,所以它只会带来很可能包含一个类似的字符串的匹配。



我知道自联接可以导致许多比较性质,但我想优化我的查询最好的我可以。我不需要即时的结果,但目前运行的查询需要11个小时才能运行。



我在Ubuntu 12.04服务器上运行Postgres 9.2。我不知道ref_name字段(我匹配的字段)的最大长度是多少,所以我将它设置为一个 varchar(300)。我想知道如果将其设置为文本类型可能会影响性能,或者如果有更好的字段类型用于加快性能。我的 LC_CTYPE LC_COLLATE 区域设置设置为en_US.UTF-8



我正在运行查询的表总共有大约160万条记录,但是需要我11小时运行的查询是小的



表结构:

  CREATE TABLE ref_name(
ref_name_id integer,
ref_name character vary(300),
ref_name_type character vary(2),
name_display text,
load_date timestamp without time zone

索引:

  CREATE INDEX ref_name_ref_name_trigram_idx ON ref_name 
USING gist(ref_name COLLATE pg_catalog。defaultgist_trgm_ops);

CREATE INDEX ref_name_ref_name_trigram_idx_1 ON ref_name
USING gist(ref_name COLLATE pg_catalog。defaultgist_trgm_ops)
WHERE ref_name_type :: text ='E':: text;

CREATE INDEX ref_name_ref_name_e_idx ON ref_name
USING btree(ref_name COLLATE pg_catalog。default)
WHERE ref_name_type :: text ='E':: text;

查询:

 code>选择a.ref_name_id作为name_id,a.ref_name AS名称,
a.name_display AS name_display,b.ref_name_id AS matched_name_id,
b.ref_name AS matched_name,b.name_display AS matched_name_display
from ref_name a
JOIN ref_name b
ON a.ref_name_id" b.ref_name_id
AND a.ref_name_id> b.ref_name_id
AND a.ref_name%b。 ref_name
WHERE
a.ref_name〜> =〜'A'和a.ref_name〜<〜'B'
AND b.ref_name〜> =〜'A'和b .ref_name〜<〜'B'
AND a.ref_name_type ='E'
AND b.ref_name_type ='E'

解释计划:

 嵌套循环(cost = 0.00..8560728.16 rows = 3598470 width = 96)
- > Seq Scan on ref_name a(cost = 0.00..96556.12 rows = 103901 width = 48)
过滤器:(((ref_name):: text 〜('=''''文本)AND((ref_name):: text〜<〜'B':: text)AND((ref_name_type):: text ='E'
- >索引扫描在ref_name b上使用ref_name_ref_name_trigram_idx_1(cost = 0.00..80.41 rows = 35 width = 48)
索引Cond:((a.ref_name)::文本%(ref_name):: text)
过滤器:(((ref_name):: text〜> =〜'A':: text)AND((ref_name):: text〜 B':: text)AND(a.ref_name_id<> ref_name_id)AND(a.ref_name_id> ref_name_id))

以下是一些示例记录: A 123系统;E;A 123系统公司;2014-11-14 00:00 :00
1652633;A123 SYSTEMS;E;A123 SYSTEMS INC;2014-11-14 00:00:00
1652640;A 1 ACCOUSICCS;E A-1 ACCOUSTICS;2014-11-14 00:00:00
1652641;A 1 ACOUSTICS;E;A-1 ACOUSTICS;2014-11-14 00:00:00
1652642;A1 ACOUSTICS;E;A1 ACOUSTICS INC;2014-11-14 00:00:00
1652650;A 1 A ELECTRICAL ;E;A-1 A ELECTRIC INC;2014-11-14 00:00:00
1652651;A 1 A ELECTRICIAN;E;A 1 A ELECTRICIAN INC ;2014-11-14 00:00:00
1652652;A 1A ELECTRICIAN;E;A 1A ELECTRICIAN INC;2014-11-14 00:00:00
1652653;A1 A ELECTRICIAN;E;A1 A ELECTRICIAN INC;2014-11-14 00:00:00
1691270;ALBERT GARLATTI;E; ALBERT GARLATTI;2014-11-14 00:00:00
1691271;ALBERT GARLATTI CONSTRUCTION;E;ALBERT GARLATTI CONSTRUCTION CO;2014-11-14 00 :00:00
1680892;AG HOG PITTSBURGH;E;AG-HOG PITTSBURGH CO INC;2014-11-14 00:00:00
1680893;AGHOG PITTSBURGH;E;AGHOG PITTSBURGH CO;2014-11-14 00:00:00
1680928;AGILE PURSUITS FRACHISING;E;AGILE PURSUITS FRACHISING INC;2014 -11-14 00:00:00
1680929;AGILE PURSUITS FRANCHISING;E;AGILE PURSUITS FRANCHISING INC;2014-11-14 00:00:00
1680956 ;老龄化协调企业与支持;E;老龄化社区协调企业支持;2014-11-14 00:00:00
1680957;老龄化协调企业与SUPPORTI;E;老龄化社区协调企业 SUPPORTI;2014-11-14 00:00:00

如你所见,我创建了一个gist trigram指数来加快速度(尝试两种不同类型到目前为止比较)有没有人有任何建议,我如何可以提高这个查询的性能,并将其从11小时降低到更易于管理的最终我会喜欢在整个表上运行这个查询来比较记录,而不仅仅是这个小的子集。

解决方案

索引



部分GiST索引很好,我至少要测试这两个附加的索引:



一个GIN索引:

  CREATE INDEX ref_name_trgm_gin_idx ON ref_name 
USING gin(ref_name gin_trgm_ops)
WHERE ref_name_type ='E';

这可能是也可能不会被使用,如果升级到Postgres 9.4,机会会更好,因为已经有了很大的改进到GIN索引。



一个varchar_pattern_ops我ndex:

  CREATE INDEX ref_name_pattern_ops_idx 
ON ref_name(ref_name varchar_pattern_ops)
WHERE ref_name_type ='E';



查询



心中的问题的这个查询,当您检查所有行的所有行时,您正在运行到具有 O(N²)的交叉连接。性能变得无法忍受,有很多行。你似乎很清楚动态。防御是限制可能的组合。你已经朝着这个方向迈出了一步,限制了同样的第一个字母。



这里有一个非常好的选择是建立在一个特殊的天赋下, strong> for 最近邻搜索。该查询技术有一个手册中的提示: p>


这可以通过GiST索引非常有效地实现,但不能由
GIN索引实现。当只需要一个
的最接近的比赛数量时,通常会打败第一个表达式。


A GIN索引仍然可以使用另外到GiST索引。你必须权衡成本和收益。总体来说,在9.4之前的版本中可以使用一个较大的索引。但是它可能在第9.4节中是值得的。



Postgres 9.2



使用相关子查询替换尚未存在的 LATERAL join:

  SELECT a。* 
,b.ref_name AS match_name
,b.name_display AS match_name_display
FROM(
SELECT ref_name_id
,ref_name
,name_display
,(SELECT ref_name_id AS match_name_id
FROM ref_name b
WHERE ref_name_type ='E'
AND ref_name ~~'A%'
AND ref_name_id> a.ref_name_id
AND ref_name%a.ref_name
ORDER BY ref_name< - > a.ref_name
LIMIT 1 - 最多1最匹配

FROM ref_name a
WHERE ref_name ~~'A%'
AND ref_name_type ='E'
)a
JOIN ref_name b ON b.ref_n ame_id = a.match_name_id
ORDER BY 1;

显然,这也需要在 ref_name_id ,通常应该是PK,因此自动编入索引。



我在 /sqlfiddle.com/#!15/5d4be/33rel =nofollow noreferrer> SQL Fiddle



Postgres 9.3 +



使用 LATERAL 连接进行匹配设置。与此相关答案中的 2a 类似:







  SELECT a.ref_name_id 
,a.ref_name
,a.name_display
,b.ref_name_id AS match_name_id
,b.ref_name AS match_name
,b.name_display AS match_name_display
FROM ref_name a
,LATERAL(
SELECT b.ref_name_id,b.ref_name,b.name_display
FROM ref_name b
WHERE b.ref_name ~~'A%'
AND b.ref_name_type ='E'
AND a.ref_name_id< b.ref_name_id
AND a.ref_name%b.ref_name - 还强制最小相似度
ORDER BY a.ref_name< ; - > b.ref_name
LIMIT 10 - 最多10最匹配
)b
WHERE a.ref_name ~~'A%' - 您可以扩展搜索
AND a.ref_name_type ='E'
ORDER BY 1;

SQL Fiddle ,所有变体与您的案例之后建模的40k行上的原始查询相比较。



查询是您的原始的小提琴2 - 5倍。我希望他们能够更好地扩展数百万行。您必须测试。



b 中的匹配搜索扩展到所有行(同时限制候选人 a 到合理的数量)也是相当便宜的。我添加了另外两个变体到小提琴。



Aside:我运行所有测试与文本而不是 varchar ,但这不应该有所作为。



基础和链接:




I am trying to run a query that joins a table against itself and does fuzzy string comparison (using trigram comparisons) to find possible company name matches. My goal is to return records where the trigram similarity of one record's company name (ref_name field) matches another record's company name. Currently, I have my threshold set to 0.9 so it will only bring back matches that are very likely to contain the a similar string.

I know that self joins can result in many comparisons by nature, but I want to optimize my query the best I can. I don't need results instantaneously, but currently the query I am running takes 11 hours to run.

I am running Postgres 9.2 on a Ubuntu 12.04 server. I don't know what the max length of the ref_name field (field I'm matching on) is, so I set it to a varchar(300). I wonder if setting it to a text type may affect performance at all or if there is a better field type to use to speed up performance. My LC_CTYPE and LC_COLLATE locales are set to "en_US.UTF-8"

The table I am running the query on consists of about 1.6 million records in total, but the query that takes me 11 hours to run is on a small subset of that (about 100k).

Table Structure:

CREATE TABLE ref_name (
  ref_name_id integer,
  ref_name character varying(300),
  ref_name_type character varying(2),
  name_display text,
  load_date timestamp without time zone
)

Indexes:

CREATE INDEX ref_name_ref_name_trigram_idx ON ref_name
  USING gist (ref_name COLLATE pg_catalog."default" gist_trgm_ops);

CREATE INDEX ref_name_ref_name_trigram_idx_1 ON ref_name
  USING gist (ref_name COLLATE pg_catalog."default" gist_trgm_ops)
  WHERE ref_name_type::text = 'E'::text;

CREATE INDEX ref_name_ref_name_e_idx ON ref_name
  USING btree (ref_name COLLATE pg_catalog."default")
  WHERE ref_name_type::text = 'E'::text;

Query:

select a.ref_name_id as name_id,a.ref_name AS name,
  a.name_display AS name_display,b.ref_name_id AS matched_name_id,
  b.ref_name AS matched_name,b.name_display AS matched_name_display
from ref_name a
JOIN ref_name b
 ON a.ref_name_id<>b.ref_name_id
 AND a.ref_name_id>b.ref_name_id
 AND a.ref_name % b.ref_name
WHERE 
 a.ref_name ~>=~ 'A' and a.ref_name ~<~'B'
 AND b.ref_name ~>=~ 'A' and b.ref_name ~<~'B'
 AND a.ref_name_type='E'
 AND b.ref_name_type='E'

Explain Plan:

"Nested Loop  (cost=0.00..8560728.16 rows=3598470 width=96)"
"  ->  Seq Scan on ref_name a  (cost=0.00..96556.12 rows=103901 width=48)"
"        Filter: (((ref_name)::text ~>=~ 'A'::text) AND ((ref_name)::text ~<~ 'B'::text) AND ((ref_name_type)::text = 'E'::text))"
"  ->  Index Scan using ref_name_ref_name_trigram_idx_1 on ref_name b  (cost=0.00..80.41 rows=35 width=48)"
"        Index Cond: ((a.ref_name)::text % (ref_name)::text)"
"        Filter: (((ref_name)::text ~>=~ 'A'::text) AND ((ref_name)::text ~<~ 'B'::text) AND (a.ref_name_id <> ref_name_id) AND (a.ref_name_id > ref_name_id))"

Here are some sample records:

1652632;"A 123 SYSTEMS";"E";"A 123 SYSTEMS INC";"2014-11-14 00:00:00"
1652633;"A123 SYSTEMS";"E";"A123 SYSTEMS INC";"2014-11-14 00:00:00"
1652640;"A 1 ACCOUSTICS";"E";"A-1 ACCOUSTICS";"2014-11-14 00:00:00"
1652641;"A 1 ACOUSTICS";"E";"A-1 ACOUSTICS";"2014-11-14 00:00:00"
1652642;"A1 ACOUSTICS";"E";"A1 ACOUSTICS INC";"2014-11-14 00:00:00"
1652650;"A 1 A ELECTRICAL";"E";"A-1 A ELECTRICAL INC";"2014-11-14 00:00:00"
1652651;"A 1 A ELECTRICIAN";"E";"A 1 A ELECTRICIAN INC";"2014-11-14 00:00:00"
1652652;"A 1A ELECTRICIAN";"E";"A 1A ELECTRICIAN INC";"2014-11-14 00:00:00"
1652653;"A1 A ELECTRICIAN";"E";"A1 A ELECTRICIAN INC";"2014-11-14 00:00:00"
1691270;"ALBERT GARLATTI";"E";"ALBERT GARLATTI";"2014-11-14 00:00:00"
1691271;"ALBERT GARLATTI CONSTRUCTION";"E";"ALBERT GARLATTI CONSTRUCTION CO";"2014-11-14 00:00:00"
1680892;"AG HOG PITTSBURGH";"E";"AG-HOG PITTSBURGH CO INC";"2014-11-14 00:00:00"
1680893;"AGHOG PITTSBURGH";"E";"AGHOG PITTSBURGH CO";"2014-11-14 00:00:00"
1680928;"AGILE PURSUITS FRACHISING";"E";"AGILE PURSUITS FRACHISING INC";"2014-11-14 00:00:00"
1680929;"AGILE PURSUITS FRANCHISING";"E";"AGILE PURSUITS FRANCHISING INC";"2014-11-14 00:00:00"
1680956;"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORT";"E";"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORT";"2014-11-14 00:00:00"
1680957;"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORTI";"E";"AGING COMMUNITY COORDINATED ENTERPRISES & SUPPORTI";"2014-11-14 00:00:00"

As you can see, I created a gist trigram index to speed things up (tried two different types so far for comparison). Does anyone have any suggestions on how I can improve the performance of this query and get it down from 11 hours to something more manageable? Eventually I would like to run this query on the whole table to compare records, not just this small subset.

解决方案

Indices

The partial GiST index is good, I would at least test these additional two indices:

A GIN index:

CREATE INDEX ref_name_trgm_gin_idx ON ref_name
USING gin (ref_name gin_trgm_ops)
WHERE ref_name_type = 'E';

This may or may not be used. If you upgrade to Postgres 9.4, chances are much better because there have been major improvements to GIN indexes.

A varchar_pattern_ops index:

CREATE INDEX ref_name_pattern_ops_idx
ON ref_name (ref_name varchar_pattern_ops)
WHERE ref_name_type = 'E';

Query

The problem at the heart of this query that you are running into a cross join with O(N²) when checking all rows against all rows. Performance becomes unbearable with a very big number of rows. You seem to be well aware of the dynamic. The defense is to limit possible combinations. You took a step in that direction already with limiting to the same first letter.

A very good option here is build on a special talent of GiST indices for nearest neighbour search. There is an hint in the manual for this query technique:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

A GIN index may still get used in addition to the GiST index. You have to weigh cost and benefit. May be cheaper overall to stick with one big index in versions before 9.4. But it's probably worth it in pg 9.4.

Postgres 9.2

Use correlated subqueries to substitute for the not yet existing missing LATERAL join:

SELECT a.*
     , b.ref_name     AS match_name
     , b.name_display AS match_name_display
FROM  (
   SELECT ref_name_id
        , ref_name
        , name_display
        , (SELECT ref_name_id AS match_name_id
           FROM   ref_name b
           WHERE  ref_name_type = 'E'
           AND    ref_name ~~ 'A%'
           AND    ref_name_id > a.ref_name_id
           AND    ref_name % a.ref_name
           ORDER  BY ref_name <-> a.ref_name
           LIMIT  1                                -- max. 1 best match
          )
   FROM   ref_name a
   WHERE  ref_name ~~ 'A%'
   AND    ref_name_type = 'E'
   ) a
JOIN   ref_name b ON b.ref_name_id = a.match_name_id
ORDER  BY 1;

Obviously, this also needs an index on ref_name_id, which should normally be the PK and therefore indexed automatically.

I added two more variants in the SQL Fiddle.

Postgres 9.3+

Use a LATERAL join for matching set to set. Similar to chapter 2a in this related answer:

SELECT a.ref_name_id
     , a.ref_name
     , a.name_display
     , b.ref_name_id  AS match_name_id
     , b.ref_name     AS match_name
     , b.name_display AS match_name_display
FROM   ref_name a
,   LATERAL (
   SELECT b.ref_name_id, b.ref_name, b.name_display
   FROM   ref_name b
   WHERE  b.ref_name ~~ 'A%'
   AND    b.ref_name_type = 'E'
   AND    a.ref_name_id < b.ref_name_id
   AND    a.ref_name % b.ref_name  -- also enforce min. similarity
   ORDER  BY a.ref_name <-> b.ref_name
   LIMIT  10                                -- max. 10 best matches
   ) b
WHERE  a.ref_name ~~ 'A%'   -- you can extend the search
AND    a.ref_name_type = 'E'
ORDER  BY 1;

SQL Fiddle with all variants compared to your original query on 40k rows modeled after your case.

Queries are 2 - 5 x faster as your original in the fiddle. And I expect them to scale much better with millions of rows. You'll have to test.

Extending the search for matches in b to all rows (while limiting candidates in a to a reasonable number) is rather cheap, too. I added two other variants to the fiddle.

Aside: I ran all tests with text instead of varchar, but that shouldn't make a difference.

Basics and links:

这篇关于通过相似性Postgres模糊自连接查询提高性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆