改进PostgreSQL中小数据文本搜索的好处和折衷 [英] Benefits and trade offs for improving text search on small data in PostgreSQL

查看:114
本文介绍了改进PostgreSQL中小数据文本搜索的好处和折衷的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


  • 我有四个感兴趣的文本列。

  • 每个列最多可包含100个字符。

  • 3列中的文本大部分是拉丁词。 (数据是一个生物目录,这些都是事物的名称。)

  • 当前数据约为500行。我预计不会超过1000个。

  • 少数用户(10岁以下)将具有添加,更新和删除数据的编辑权限。我不希望这些用户对数据库造成沉重负担。

  • I have 4 text columns of interest.
  • Each column is up to about 100 characters.
  • The text in 3 of the columns is mostly Latin words. (The data is a biological catalog, and these are names of things.)
  • The data is currently about 500 rows. I don't expect this to grow beyond 1000.
  • A small number of users (under 10) will have editing privileges to add, update, and delete data. I do not expect these users to put a heavy load on the database.

因此,所有这些都表明需要考虑的数据集很小。

So all this suggests a pretty small data set to consider.

我需要对所有4列进行搜索,以查找至少有1列包含搜索文本的行(不区分大小写)。该查询将通过Web应用程序发出(并提供结果)。

I need to perform a search on all 4 columns for rows where at least 1 column contains the search text (case insensitive). The query will be issued (and the results served) via a web application. I'm a bit lost about how to approach it.

PostgreSQL提供了一些提高文本搜索速度的选项。我一直在考虑的PostgreSQL中内置的可能选项是

PostgreSQL offers a few options for improving text searching speed. The possible options built into PostgreSQL I've been considering are


  1. 完全不要尝试为此编制索引。只需在较低的上使用 ILIKE LIKE 或类似的格式即可。 (没有索引?)

  2. 带有pg_trgm的索引以提高搜索速度。我认为我需要以某种方式索引连接。

  3. 全文搜索。我认为这也会涉及到索引的连接。

  1. Don't try to index this at all. Just use ILIKE, LIKE on lower, or similar. (Without an index?)
  2. Index with pg_trgm to improve search speed. I would assume that I would need to index the concatenation somehow.
  3. Full text searching. I assume this would involve concatenating for the index also.

不幸的是,我对这些工具的预期性能并不十分熟悉或利益与权衡,所以很难知道我应该先尝试什么,甚至不考虑什么。我读过的一些东西表明对2和3进行索引非常慢,这与我偶尔进行修改的事实相矛盾。而且,混合语言使全文本搜索似乎没有吸引力,因为它似乎是基于语言的,除非它可以同时处理多种语言。对于较小的数据,我是否希望使用简单的 ILIKE 下的 Like 可能足够快?还是索引速度足够快,以至于对这么小的数据进行低负载修改?

Unfortunately, I'm not really familiar with the expected performance of any of these or the benefits and trades off, so it's hard to know what things I should try first and what things I shouldn't even consider. Some things I have read suggest that doing the indexing for 2 and 3 is pretty slow, which conflicts with the fact that I'll be having occasional modifications going on. And the mixed language makes full text search seem unattractive since it appears to be language based, unless it can handle multiple languages simultaneously. Would I expect that for data this small, a simple ILIKE or maybe a LIKE on lower is probably fast enough? Or maybe the indexing is fast enough for the low load of modifications on data this small? Would I be better off looking for something outside the database?

当然,我必须对所有这些进行基准测试才能真正确定最快的是什么,但是不幸的是,我没有这个项目没有太多时间。那么这些方法的好处和取舍是什么?这些选项中哪些不适合解决此类问题?还有其他值得考虑的其他类型的解决方案(包括潜在的数据库外部)吗?

Granted, I would have to actually benchmark all these to really know for sure what's fastest, but unfortunately, I don't have much time for this project. So what are the benefits and trade offs of these methods? What of these options are not appropriate for solving this type of problem? What are some other types of solutions (including potentially outside the database) worth considering?

(我想我可能会发现有关PG中文本搜索的初学者教程很有用,但是我的搜索大部分时间都是全文搜索,我什至都不知道它是否对我有用。)

(I suppose I might find some kind of beginner's tutorial on text searching in PG useful, but my searches turn up Full Text Search for the most part, which I don't even know if it's useful for me.)

我使用的是PG 9.2.4 ,因此9.3版之前的所有功能都是一个选择。

I'm on PG 9.2.4, so any goodies pre-9.3 are an option.

推荐答案

更新:我已经对此进行了扩展回答详细的博客文章

Update: I've expanded this answer into a detailed blog post.

而不是只关注速度,请先考虑搜索语义。定义您的要求。

Rather than focusing purely on speed, please consider search semantics first. Define your requirements.

例如,用户是否需要能够根据条款的 order 进行区分?应该

For example, do users need to be able to differentiate based on the order of terms? Should

radiata pinus

发现:

pinus radiata

空格是否始终是单词分隔符,还是搜索词的列部分中的空格?

Are spaces always word separators, or are spaces within a column part of the search term?

是否需要通配符?如果是这样,您是否只需要左固定通配符(请考虑 staph%),还是也需要右固定或中缀通配符(%ccus p%s )?只有 pg_tgrm 会帮助您使用中缀通配符。后缀通配符可以通过单词的 reverse()上的索引进行处理,但这很快就变得笨拙,因此在实践中 pg_tgrm 是那里最好的选择。

Do you need wildcards? If so, do you need only left-anchored wildcards (think staph%) or do you need right-anchored or infix wildcards too (%ccus, p%s)? Only pg_tgrm will help you with infix wildcards. Suffix wildcards can be handled by an index on the reverse() of a word, but that gets clumsy quickly so in practice pg_tgrm is the best option there.

如果您主要搜索离散单词,但单词顺序并不重要,则Pg的全文搜索为 to_tsvector to_tsquery 是可取的。它支持左锚通配符搜索,权重,类别等。

If you're mostly searching for discrete words and word-order isn't important, Pg's full-text search with to_tsvector and to_tsquery will be desirable. It supports left-anchored wildcard searches, weighting, categories, etc.

如果您主要是对离散列进行前缀搜索,则可以简单地 LIKE 每列常规b树索引的查询将是解决方法。

If you're mostly doing prefix searches of discrete columns then simple LIKE queries on a regular b-tree index per column will be the way to go.

所以。找出需要的内容,然后然后进行操作。您当前的不确定性可能部分是由于不完全了解您想要的东西。

So. Figure out what you need, then how to do it. Your current uncertainty probably stems partly from not really knowing quite what you want.

这篇关于改进PostgreSQL中小数据文本搜索的好处和折衷的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆