单字符文本搜索替代 [英] Single character text search alternative

查看:96
本文介绍了单字符文本搜索替代的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

要求:确保以最有效,最高效的方式处理复合列上的单字符ci文本搜索,包括相关权重排序;

拥有表创建表test_search(id int primary键,full_name varchar(300)不为null,short_name varchar(30)不为null); 具有3百万行的行建议程序api调用从第一个输入字符开始将查询发送到db,按相关性应将前20个结果排序被退回。

Requirement: ensure single character ci text search over compound columns is processed in most efficient and performant way including relevance weight sorting;
Having a table create table test_search (id int primary key, full_name varchar(300) not null, short_name varchar(30) not null); with 3 mln rows suggester api call sends queries to db starting from first input character and first 20 results ordered by relevance should be returned.

选项/缺点:


  • / ilike 超过'%c%'在大型数据集上速度较慢,没有关联

  • pg_trgm,具有基于三字母组的搜索 like / ilike +复合 gin / gist 索引:单个字符不能拆分成几个字母,因此搜索通过表fullscan完成,没有关联

  • 通过 setweight(to_tsvector(lower()))进行全文搜索 gin / gist 索引:基于相关性的输出,但由于标记排除单个字符而导致的结果较少

  • like lower() / ilike over '%c%': slow on big dataset, no relevance;
  • pg_trgm with trigram based search like/ilike + compound gin/gist index: single character cannot be splitted into several trigrams so search is done via table fullscan, no relevance;
  • fulltext search via setweight(to_tsvector(lower())) gin/gist index: relevance based output but less results because of tokens exclude single characters;

是否有其他选项可以改善单字符搜索?如何改进或混合上述以获得最佳结果?如何强迫全文跳过非索引字表并创建所有可能的词素,例如 it是否可以用于sqlserver

Are there other options available to improve single character search? How to improve or mix mentioned above to get the best result? How to force fulltext to skip stoplist and create all possible lexemes like it is possible for sqlserver?

推荐答案

全文搜索完全不能帮助您,因为仅整个单词都被索引了,您无法搜索子字符串。

Full-text search won't help you at all with this, because only whole words are indexed, and you cannot search for substrings.

您可能可以做的最好的事情就是使用此功能:

The best you can probably do is use this function:

CREATE FUNCTION get_chars(text) RETURNS char(1)[]
   LANGUAGE sql IMMUTABLE AS
$$SELECT array_agg(DISTINCT x)::char(1)[] FROM regexp_split_to_table($1, '') AS x$$;

然后编制索引

CREATE INDEX ON test_search USING gin (get_chars(full_name || short_name));

并像

SELECT * FROM test_search
WHERE get_chars(full_name || short_name) @> ARRAY['c']::char(1)[];

对于频繁出现的字符,此查询仍应使用顺序扫描,因为这是最佳的访问方法。但是对于稀有字符,您可能会更快。

For frequent characters, this query should still use a sequential scan, since that is the best access method. But for rare characters you may be faster that way.

这篇关于单字符文本搜索替代的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆