索引数组以进行全文搜索 [英] Indexing an array for full text search

查看:92
本文介绍了索引数组以进行全文搜索的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试索引可在其标签数组上搜索的文档。

 创建索引doc_search_idx在文档
上使用gin(
to_tsvector('english',array_to_string(tags,''))||
to_tsvector('english',coalesce(notes,')))

其中标记( ci)text [] 。但是,PG将拒绝索引 array_to_string ,因为它并非始终不变

  PG :: InvalidObjectDefinition :错误:索引表达式中的函数必须标记为IMMUTABLE 

我尝试创建自制程序 array_to_string 不可变的函数,但是我不知道自己在做什么,所以喜欢玩火。有没有办法不重新实现它?



看起来我可以重新包装相同的函数并将其标记为不可变,但是看起来像这样做有风险



如何为全文搜索索引数组?

解决方案

在我的最初答案中,我建议将文本强制转换为文本: tags :: text 。但是,尽管大多数从基本类型到文本的强制转换都是在 IMMUTABLE 中定义的,但对于数组类型却并非如此。显然是因为(在pgsql-general的帖子中引用Tom Lane a>):


因为它是通过array_out / array_in而不是其他
直接方法实现的,所以它们是标记为稳定,因为它们可能
调用不可更改的元素I / O函数


加粗强调。



我们可以解决这个问题。一般情况下不能标记为 IMMUTABLE 。但是对于当前情况(将 citext [] text [] 投射到 text ),我们可以放心地假设不变性。创建一个简单的 IMMUTABLE SQL函数来包装该函数。但是,我的简单解决方案的吸引力现在几乎消失了。您也可以将 array_to_string()包装起来(就像您已经考虑过的那样),并考虑类似的考虑。



对于 citext [] (如果需要,可为 text [] 创建单独的函数):



两者之一(基于对文本的普通转换):

 创建或替换功能f_ciarr2text(citext [])
返回文本语言SQL IMMUTABLE AS'SELECT $ 1 :: text';

这更快。

或(使用 array_to_string ()表示不带花括号的结果):

 创建或替换功能f_ciarr2text(citext [] )
返回文本语言SQL IMMUTABLE AS $$ SELECT array_to_string($ 1,',')$$;

这有点正确。

然后:

 创建索引doc_search_idx使用gin打开文档doc(
to_tsvector('english',COALESCE(f_ciarr2text(tags),'')
||''|| COALESCE(notes,''))));

我没有使用多态类型 ANYARRAY 就像在您的答案中,因为我知道文本[] citext [] 是安全的,但是我不能保证 all 其他数组类型。



在Postgres 9.4中进行了测试,对我有用。



我在两个字符串之间添加了一个空格,以避免连接字符串之间出现假正匹配。手册中有一个示例



如果您有时只想搜索标记或仅搜索笔记,请考虑使用多列索引:

 创建索引doc_search_idx使用gin的ON文档(
to_tsvector('english ',COALESCE(f_ciarr2text(tags),'')
,to_tsvector('english',COALESCE(notes,''));

您所指的

风险主要适用于时间函数,这些函数在引用的问题。如果涉及时区(或只是类型 timestamptz ),结果实际上不是不变的。在这里,我们不会说谎,因为我们的功能实际上是 IMMUTABLE

相关



通常人们认为他们需要< href = http://www.postgresql.org/docs/current/interactive/textsearch.html rel = noreferrer>文本搜索,而具有三字母索引的相似性搜索将更合适:





在这种情况下不相关,但是在使用 citext ,请考虑以下问题:




I am trying to index documents to be searchable on their tag array.

CREATE INDEX doc_search_idx ON documents
      USING gin( 
    to_tsvector('english', array_to_string(tags, ' ')) ||
    to_tsvector('english', coalesce(notes, '')))
)

Where tags is a (ci)text[]. However, PG will refuse to index array_to_string because it is not always immutable.

PG::InvalidObjectDefinition: ERROR:  functions in index expression must be marked IMMUTABLE

I've tried creating a homebrew array_to_string immutable function, but I feel like playing with fire as I don't know what I'm doing. Any way not to re-implement it?

Looks like I could just repackage the same function and label it immutable, but looks like there are risks when doing that.

How do I index the array for full-text search?

解决方案

In my initial answer I suggested a plain cast to text: tags::text. However, while most casts to text from basic types are defined IMMUTABLE, this it is not the case for array types. Obviously because (quoting Tom Lane in a post to pgsql-general):

Because it's implemented via array_out/array_in rather than any more direct method, and those are marked stable because they potentially invoke non-immutable element I/O functions.

Bold emphasis mine.

We can work with that. The general case cannot be marked as IMMUTABLE. But for the case at hand (cast citext[] or text[] to text) we can safely assume immutability. Create a simple IMMUTABLE SQL function that wraps the function. However, the appeal of my simple solution is mostly gone now. You might as well wrap array_to_string() (like you already pondered) for which similar considerations apply.

For citext[] (create separate functions for text[] if needed):

Either (based on a plain cast to text):

CREATE OR REPLACE FUNCTION f_ciarr2text(citext[]) 
  RETURNS text LANGUAGE sql IMMUTABLE AS 'SELECT $1::text';

This is faster.
Or (using array_to_string() for a result without curly braces):

CREATE OR REPLACE FUNCTION f_ciarr2text(citext[]) 
  RETURNS text LANGUAGE sql IMMUTABLE AS $$SELECT array_to_string($1, ',')$$;

This is a tiny bit more correct.
Then:

CREATE INDEX doc_search_idx ON documents USING gin (
   to_tsvector('english', COALESCE(f_ciarr2text(tags), '')
                || ' ' || COALESCE(notes,'')));

I did not use the polymorphic type ANYARRAY like in your answer, because I know text[] or citext[] are safe, but I can't vouch for all other array types.

Tested in Postgres 9.4 and works for me.

I added a space between the two strings to avoid false positive matches across the concatenated strings. There is an example in the manual.

If you sometimes want to search just tags or just notes, consider a multicolumn index instead:

CREATE INDEX doc_search_idx ON documents USING gin (
             to_tsvector('english', COALESCE(f_ciarr2text(tags), '')
          ,  to_tsvector('english', COALESCE(notes,''));

The risks you are referring to apply to temporal functions mostly, which are used in the referenced question. If time zones (or just the type timestamptz) are involved, results are not actually immutable. We do not lie about immutability here. Our functions are actually IMMUTABLE. Postgres just can't tell from the general implementation it uses.

Related

Often people think they need text search, while similarity search with trigram indexes would be a better fit:

Not relevant in this exact case, but while working with citext, consider this:

这篇关于索引数组以进行全文搜索的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆