PostgreSQL全文搜索和Trigram混淆 [英] PostgreSQL Full Text Search and Trigram Confusion

查看:60
本文介绍了PostgreSQL全文搜索和Trigram混淆的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对PostgreSQL的整个概念,全文搜索和Trigram有点困惑.在我的全文搜索查询中,我正在使用tsvectors,就像这样:

I'm a little bit confused with the whole concept of PostgreSQL, full text search and Trigram. In my full text search queries, I'm using tsvectors, like so:

SELECT * FROM articles
WHERE search_vector @@ plainto_tsquery('english', 'cat, bat, rat');

问题是,此方法不能解决拼写错误.然后,我开始阅读有关 Trigram和pg_trgm :

The problem is, this method doesn't account for misspelling. Then I started to read about Trigram and pg_trgm:

通过其他示例,似乎使用了trigram或使用了vector,但从未同时使用这两种方法.所以我的问题是:它们曾经一起使用过吗?如果是这样,怎么办? Trigram会取代全文吗?语法更准确吗?语法上的表现如何?

Looking through other examples, it seems like trigram is used or vectors are used, but never both. So my questions are: Are they ever used together? If so, how? Does trigram replace full text? Are trigrams more accurate? And how are trigrams on performance?

推荐答案

它们的用途截然不同.

They serve very different purposes.

  • 全文搜索用于返回与词干搜索查询匹配的文档.
  • Trigrams为您提供了一种比较两个字符串并确定它们看起来如何相似的方法.

请考虑以下示例:

SELECT 'cat' % 'cats'; --true

上面的返回true,因为'cat''cats'非常相似(由pg_trgm限制决定).

The above returns true because 'cat' is quite similar to 'cats' (as dictated by the pg_trgm limit).

SELECT 'there is a cat with a dog' % 'cats'; --false

上面的返回false是因为%在两个完整的字符串之间进行类似的查找,而不是在字符串中的cats 中查找单词.

The above returns false because % is looking for similarily between the two entire strings, not looking for the word cats within the string.

SELECT to_tsvector('there is a cat with a dog') @@ to_tsquery('cats'); --true

这将返回true,因为tsvector将字符串转换为词干列表,并忽略了一堆常用词(停用词-如"is"和"a")...然后搜索了词干版本cats.

This returns true becauase tsvector transformed the string into a list of stemmed words and ignored a bunch of common words (stop words - like 'is' & 'a')... then searched for the stemmed version of cats.

听起来您想使用三字组来自动更正您的ts_query,但这实际上是不可能的(无论如何都不能以任何有效的方式).他们并不会真的知道一个单词是拼写错误的,只是它与另一个单词有多么相似.它们可以用于搜索单词表以尝试查找相似的单词,从而允许您实现您的意思是……"类型的功能,但是此单词需要维护一个单独的表,其中包含所有search字段中使用的字词.

It sounds like you want to use trigrams to auto-correct your ts_query but that is not really possible (not in any efficient way anyway). They do not really know a word is misspelt, just how similar it might be to another word. They could be used to search a table of words to try and find similar words, allowing you to implement a "did you mean..." type feature, but this word require maintaining a separate table containing all the words used in your search field.

如果您希望文本索引匹配一些常见的拼写错误的词/短语,则可能需要查看

If you have some commonly misspelt words/phrases that you want the text-index to match you might want to look at Synonym Dictorionaries

这篇关于PostgreSQL全文搜索和Trigram混淆的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆