如何为 pg_trgm `'term' % ANY (array_column)` 查询索引字符串数组列? [英] How to index a string array column for pg_trgm `'term' % ANY (array_column)` query?

查看:66
本文介绍了如何为 pg_trgm `'term' % ANY (array_column)` 查询索引字符串数组列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试了普通的 Postgres gin 索引以及 pg_trgm gin_trgm_opsgist_trgm_ops 索引(使用此解决方法:https://stackoverflow.com/a/33016333/283398).

I have tried an ordinary Postgres gin index as well as the pg_trgm gin_trgm_ops and gist_trgm_ops indexes (using this workaround: https://stackoverflow.com/a/33016333/283398).

但是,在我的查询 'term' % ANY (array_column) 中的 EXPLAIN 即使在执行了 set enable_seqscan = off; 后仍显示顺序扫描.

However EXPLAIN on my query 'term' % ANY (array_column) shows a sequential scan even after having executed set enable_seqscan = off;.

(对于我的用例,我需要部分匹配,pg_trgm 似乎比全文搜索更适合,因为我的数据不是语言的.我的 pg_trgm 结果的质量非常好.)

(For my use case, I need partial matches and pg_trgm seems like a much better fit than full-text search because my data is not linguistic. The quality of my pg_trgm results is very good.)

我的用例是包含一个数组列的行,其中包含名字和全名(空格分隔)的混合.搜索词可以是名字、姓氏或全名(以空格分隔).pg_trgm % 运算符结果不区分大小写,并且似乎对数组列中名称的开头和结尾的高度匹配进行加权,这对于全名非常有用,因为它可以找到匹配的名字和姓氏,但不一定是中间名.

My use case is rows with an array column containing a mix of first names and full names (space-delimited). The search term may be a first, last, or full name (space-delimited). The pg_trgm % operator results are case insensitive and appear to weight highly matches at the beginning and end of the names in the array column, which is great for full names because it finds matching first and last names but not necessarily middle names.

https://github.com/theirix/parray_gin 很有前途,但很旧,而且没有不声称支持 9.2 以上的 Postgres.

https://github.com/theirix/parray_gin is promising, but old, and doesn't claim to support Postgres newer than 9.2.

推荐答案

为什么这不起作用

索引类型(即操作符类)gin_trgm_ops 基于 % 操作符,它作用于两个 text 参数:

Why this does not work

The index type (i.e. operator class) gin_trgm_ops is based on % operator, which works on two text arguments:

CREATE OPERATOR trgm.%(
  PROCEDURE = trgm.similarity_op,
  LEFTARG = text,
  RIGHTARG = text,
  COMMUTATOR = %,
  RESTRICT = contsel,
  JOIN = contjoinsel);

您不能将 gin_trgm_ops 用于数组.为数组列定义的索引永远不会与 any(array[...]) 一起使用,因为数组的单个元素没有被索引.索引数组需要不同类型的索引,即 gin 数组索引.

You cannot use gin_trgm_ops for arrays. An index defined for an array column will never work with any(array[...]) because individual elements of arrays are not indexed. Indexing an array would require a different type of index, namely gin array index.

幸运的是,索引 gin_trgm_ops 的设计非常巧妙,它可以与运算符 likeilike 一起使用,它们可以用作替代解决方案(示例如下所述).

Fortunately, the index gin_trgm_ops has been so cleverly designed that it is working with operators like and ilike, which can be used as an alternative solution (example described below).

有两列 (id serial primary key, names text[]) 并包含 100000 条拉丁语句子,分成数组元素.

has two columns (id serial primary key, names text[]) and contains 100000 latin sentences split into array elements.

select count(*), sum(cardinality(names))::int words from test;

 count  |  words  
--------+---------
 100000 | 1799389

select * from test limit 1;

 id |                                                     names                                                     
----+---------------------------------------------------------------------------------------------------------------
  1 | {fugiat,odio,aut,quis,dolorem,exercitationem,fugiat,voluptates,facere,error,debitis,ut,nam,et,voluptatem,eum}

搜索词片段 praesent 在 2400 毫秒内得到 7051 行:

Searching for the word fragment praesent gives 7051 rows in 2400 ms:

explain analyse
select count(*)
from test
where 'praesent' % any(names);

                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5479.49..5479.50 rows=1 width=0) (actual time=2400.866..2400.866 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..5477.00 rows=996 width=0) (actual time=1.464..2400.271 rows=7051 loops=1)
         Filter: ('praesent'::text % ANY (names))
         Rows Removed by Filter: 92949
 Planning time: 1.038 ms
 Execution time: 2400.916 ms

物化视图

一种解决方案是规范化模型,包括在一行中创建一个具有单个名称的新表.由于现有的查询、视图、功能或其他依赖关系,这种重组可能难以实施,有时甚至不可能实施.不改变表结构,使用物化视图也可以达到类似的效果.

Materialized view

One solution is to normalize the model, involving the creation of a new table with a single name in one row. Such restructuring may be difficult to implement and sometimes impossible due to existing queries, views, functions, or other dependencies. A similar effect can be achieved without changing table structure, using a materialized view.

create materialized view test_names as
    select id, name, name_id
    from test
    cross join unnest(names) with ordinality u(name, name_id)
    with data;

With ordinal 不是必需的,但在以与主表中相同的顺序聚合名称时很有用.查询test_names 得到与主表相同的结果.

With ordinality is not necessary, but can be useful when aggregating the names in the same order as in the main table. Querying test_names gives the same results as the main table in the same time.

创建索引后执行时间反复减少:

After creating the index execution time decreases repeatedly:

create index on test_names using gin (name gin_trgm_ops);

explain analyse
select count(distinct id)
from test_names
where 'praesent' % name

                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4888.89..4888.90 rows=1 width=4) (actual time=56.045..56.045 rows=1 loops=1)
   ->  Bitmap Heap Scan on test_names  (cost=141.95..4884.39 rows=1799 width=4) (actual time=10.513..54.987 rows=7230 loops=1)
         Recheck Cond: ('praesent'::text % name)
         Rows Removed by Index Recheck: 7219
         Heap Blocks: exact=8122
         ->  Bitmap Index Scan on test_names_name_idx  (cost=0.00..141.50 rows=1799 width=0) (actual time=9.512..9.512 rows=14449 loops=1)
               Index Cond: ('praesent'::text % name)
 Planning time: 2.990 ms
 Execution time: 56.521 ms

该解决方案有一些缺点.因为视图是物化的,所以数据在数据库中存储了两次.您必须记住在更改主表后刷新视图.并且查询可能更复杂,因为需要将视图加入到主表中.

The solution has a few drawbacks. Because the view is materialized, the data is stored twice in the database. You have to remember to refresh the view after changes to the main table. And queries may be more complicated because of the need to join the view to the main table.

我们可以在表示为文本的数组上使用 ilike.我们需要一个不可变的函数来为整个数组创建索引:

We can use ilike on the arrays represented as text. We need an immutable function to create the index on the array as a whole:

create function text(text[])
returns text language sql immutable as
$$ select $1::text $$

create index on test using gin (text(names) gin_trgm_ops);

并在查询中使用该函数:

and use the function in queries:

explain analyse
select count(*)
from test
where text(names) ilike '%praesent%' 

                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=117.06..117.07 rows=1 width=0) (actual time=60.585..60.585 rows=1 loops=1)
   ->  Bitmap Heap Scan on test  (cost=76.08..117.03 rows=10 width=0) (actual time=2.560..60.161 rows=7051 loops=1)
         Recheck Cond: (text(names) ~~* '%praesent%'::text)
         Heap Blocks: exact=2899
         ->  Bitmap Index Scan on test_text_idx  (cost=0.00..76.08 rows=10 width=0) (actual time=2.160..2.160 rows=7051 loops=1)
               Index Cond: (text(names) ~~* '%praesent%'::text)
 Planning time: 3.301 ms
 Execution time: 60.876 ms

60 与 2400 毫秒,相当不错的结果,无需创建额外的关系.

60 versus 2400 ms, quite nice result without the need of creating additional relations.

这个解决方案看起来更简单,需要的工作更少,但前提是 ilike 比 trgm % 运算符更不精确的工具就足够了.

This solution seems simpler and requiring less work, provided however that ilike, which is less precise tool than the trgm % operator, is sufficient.

为什么我们应该使用 ilike 而不是 % 将整个数组作为文本?相似度很大程度上取决于文本的长度.在各种长度的长文本中搜索一个词很难选择一个合适的限制.例如.limit = 0.3 我们有结果:

Why should we use ilike rather than % for whole arrays as text? The similarity depends largely on the length of the texts. It is very difficult to choose an appropriate limit for the search a word in long texts of various length. E.g. with limit = 0.3 we have the results:

with data(txt) as (
values
    ('praesentium,distinctio,modi,nulla,commodi,tempore'),
    ('praesentium,distinctio,modi,nulla,commodi'),
    ('praesentium,distinctio,modi,nulla'),
    ('praesentium,distinctio,modi'),
    ('praesentium,distinctio'),
    ('praesentium')
)
select length(txt), similarity('praesent', txt), 'praesent' % txt "matched?"
from data;

 length | similarity | matched? 
--------+------------+----------
     49 |   0.166667 | f           <--!
     41 |        0.2 | f           <--!
     33 |   0.228571 | f           <--!
     27 |   0.275862 | f           <--!
     22 |   0.333333 | t
     11 |   0.615385 | t
(6 rows)

这篇关于如何为 pg_trgm `'term' % ANY (array_column)` 查询索引字符串数组列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆