当名称可以是任何语言时,如何按名称索引postgres表? [英] How to index a postgres table by name, when the name can be in any language?

查看:114
本文介绍了当名称可以是任何语言时,如何按名称索引postgres表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的postgres位置(商店,地标等)表,用户可以通过各种方式搜索。当用户想要搜索某个地点的名称时,系统当前会这样做(假设搜索在咖啡馆):

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):

lower(location_name) LIKE '%cafe%'

作为查询的一部分。这非常低效。这是非常的。我必须加快速度。我试过索引表格

as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on

gin(to_tsvector('simple', location_name))

并搜索

(to_tsvector('simple',location_name) @@ to_tsquery('simple','cafe'))

它的工作效果很好,并且将搜索时间减少了几个数量级。

which works beautifully, and cuts down the search time by a couple of orders of magnitude.

但是,位置名称可以是任何语言,包括中文等语言不是以空格分隔的。这个新系统无法找到任何中文位置,除非我搜索确切的名称,而旧系统可以找到与部分名称匹配就好了。

However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.

所以,我的问题是:我可以一次性使用所有语言,还是我走错了路?

So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?

推荐答案

如果你想要优化任意子串匹配,一个选项是使用 pg_tgrm 模块。添加索引:

If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:

CREATE INDEX table_location_name_trigrams_key ON table
  USING gin (location_name gin_trgm_ops);

这会将Simple Cafe分解为sim,imp,mpl等。,并为每行中的每个trigam添加一个条目。然后,查询计划程序可以自动将此索引用于子字符串模式匹配,包括:

This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:

SELECT * FROM table WHERE location_name ILIKE '%cafe%';

此查询将在索引中查找caf和afe,找到交集,获取那些行,然后根据您的模式检查每一行。 (最后一次检查是必要的,因为caf和afe的交叉点都匹配简单咖啡馆和不安全的脚手架,而%cafe%应该只匹配一个)。随着输入模式变得更长,索引变得更有效,因为它可以排除更多行,但它仍然不如索引整个单词那么高效,所以不要指望性能比 to_tsvector

This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.

Catch是,对于三个字符以下的模式,三元组根本不起作用。这可能会或可能不会成为您申请的交易破坏者。

Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.

编辑我最初将此作为评论添加。

I initially added this as a comment.

昨晚我大部分时间都在睡觉时想到了另一个想法。创建一个 cjk_chars 函数,它接受一个输入字符串, regexp_matches 整个CJK Unicode范围,并返回任何这样的数组字符或 NULL 如果没有。在 cjk_chars(location_name)上添加GIN索引。然后查询:

I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:

WHERE CASE
  WHEN cjk_chars('query') IS NOT NULL THEN
    cjk_chars(location_name) @> cjk_chars('query')
    AND location_name LIKE '%query%'
  ELSE
    <tsvector/trigrams>
  END

Ta-da,unigrams!

Ta-da, unigrams!

这篇关于当名称可以是任何语言时,如何按名称索引postgres表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆