搜索表达索引 [英] Searching on expression indexes

查看:89
本文介绍了搜索表达索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

搜索表达式索引

我正在构建一个对源字段进行子字符串化的表达式索引,以避免溢出B树上的2172个字符限制:

I'm building an expression index that substrings a source field to avoid overflowing the 2172 character limit on B-trees:

CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail 
    USING btree ((substring(old_value,1,1024)::text) text_pattern_ops);

记录如下:

-在RDS上为Postgres 11.4,在家里的macOS上为Postgres 11.4。

-在我的测试设置中,record_changes_log_detail表的大小约为800万。 >
-字段中的值的长度范围从1个字符到5,000个以上。大部分都很短。

-- Postgres 11.4 on RDS, 11.5 on macOS at home.
-- The record_changes_log_detail table has about 8M in my test setup.
-- The old_value field is of type citext.
-- Values in the field range in length from 1 character to over 5,000. Most are short.

此搜索使用上面指定的索引:

This search uses the index specified above:

select * from record_changes_log_detail 
where substring(old_value,1,1024) = 'Gold Kerrison Neuro';

此搜索使用索引:

where old_value = 'Gold Kerrison Neuro';

我发现这很令人惊讶,而且真是令人讨厌。如果我正确理解了jjanes关于另一个问题的评论,则计划程序仅会认识到当您的查询语句使用完全相同的表达式时,索引才适用。换句话说,任何编写查询的人都需要知道索引定义的详细信息,否则就不会使用该索引。

I found this surprising, and a real bummer. If I understand correctly a comment from jjanes on another question, the planner only recognizes that the index applies when your query statement uses exactly the same expression. In other words, anyone writing a query needs to know the details of the index definition or else the index won't be used.

我一直认为,当构建表达索引时,缩写/提取/等。值已存储,计划者将对其进行检查。 除了重述整个表达式之外,还有什么其他方法可以暗示计划者吗?索引具有正确的数据,但是计划者似乎跳过了它。

I had been assuming that when the expression index was constructed, the abbreviated/extracted/etc. value was stored and that the planner would check it. Is there any way to hint the planner other than to recapitulate the entire expression? The index has the right data, but the planner seems to skip over it.

我根据Erwin Brandstetter的答案添加了一些细节:

I'm adding a bit of detail based on Erwin Brandstetter's answer:

我有很多类似情况的 ,这就是为什么我在这里详细介绍。在这种情况下,我的约800万行中,只有6个值的长度大于2172个字符,而99.93%的值是100个字符或更少。

I have lots of similar situations, which is why I'm digging in here on the details. In this case, of my ~8M rows, only 6 have values longer than 2172 characters, and 99.93% of the values are 100 characters or less.

我是希望这是一种易于他人接受的方法。阴影场很可能是答案,因为必须知道索引构造的确切详细信息会给我一种完全错误的可见性。一旦知道要使用它,阴影场就不会受到这个问题的困扰。正如您提到的,我可以用LEFT(old_field,128)或其他一些长度或texthash(old_field)填充它。我将对此进行试验。我的数据太短了,以至于散列似乎导致很高的冲突率。

What I'm hoping for is an approach that is easy for someone else to pick up. A shadow-field might well be the answer as having to know the exact details of index constructions strikes me as exactly the wrong sort of visibility. A shadow field doesn't suffer from that problem, once you know to use it. I could either populate it with LEFT(old_field,128) or some other length, or texthash(old_field), as you mentioned. I'll experiment with that. My data is so skewed to short values, that hashing seems to lead to a high collision rate.

对于我来说,这支球队和我来自一个系统,在B树中建立索引时,文本字段将被静默修剪为1024个字符。对用户完全透明,搜索参考索引。我知道苹果和火花塞。关键是我不希望Postgres成为AI,但是我 am 的先验信息不正确。因此,感谢您和其他所有人帮助我更多地了解Postgres的工作原理。

For what it's worth, the team and I are coming from a system where text fields are silently trimmed to 1024 characters when indexed in a B-tree. It's completely transparent to the user and searches consult the index. Apples and spark plugs, I know. The point is that I'm not expecting Postgres to be an AI, but I am coming with inaccurate priors. So thanks to you and everyone else for helping me learn more about how Postgres actually works.


此问题已得到解答,但我想为档案添加一些后续内容。我一直在从一些古老的答案中学习很多。因此,这里有一些将来的信息。我尝试了四种解决方案:

This question has been answered, but I want to add some follow-up for the archives. I've been learning a lot from old answers, some very old. So here's a bit of information for the future. I tried out four solutions:


  • 在部分citext字段中使用B树。

  • citext字段的哈希值上的B树。

  • citext字段的哈希索引。

  • citext字段的Tri-gram GIN索引。

  • B-tree on a portion of the citext field.
  • B-tree on a hash of the citext field.
  • Hash index of the citext field.
  • Tri-gram GIN index of the citext field.

由于似乎没有任何方法可以在citext上获得LIKE类型的查询,而文本可能太长,因此目标是为=创建索引。以上三个中的任何一个都可以,但是它们之间有很大的不同。以下是该测试的一些设置代码:

Since there doesn't appear to be any way to get LIKE-type queries on citext where the text may be too long, the goal is to create an index for =. Any of the three above will work okay, but they differ quite a bit. Here's some setup code for the test:

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_btree;
CREATE INDEX record_changes_log_detail_old_value_ix_btree
    ON record_changes_log_detail
    USING btree ((left(old_value,1024)::citext) citext_pattern_ops);


DROP INDEX IF EXISTS record_changes_log_detail_old_value_hash_ix_btree;
CREATE INDEX record_changes_log_detail_old_value_hash_ix_btree
    ON record_changes_log_detail
    USING btree (hashtext(old_value));

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_hash;
CREATE INDEX record_changes_log_detail_old_value_ix_hash
    ON record_changes_log_detail
    USING hash (old_value);    

DROP INDEX IF EXISTS record_changes_log_detail_old_value_ix_tgrm;
CREATE INDEX record_changes_log_detail_old_value_ix_tgrm
    ON record_changes_log_detail 
    USING gin (old_value gin_trgm_ops);

VACUUM ANALYZE;

这些索引每个都可以找到记录,但是语法不同:

These indexes each work to find a record, but with different syntax:

-- Uses the LEFT()::citext index
explain analyze
select * from record_changes_log_detail 
where left(old_value,1024)::citext = 'Gold Kerrison Neuro';

-- Uses the HASH index
explain analyze
select * from record_changes_log_detail 
where old_value = 'Gold Kerrison Neuro';

-- Uses the HASHTEXT() index
explain analyze
select * from record_changes_log_detail 
where hashtext(old_value) = hashtext('Gold Kerrison Neuro');

-- Uses the tri-gram() index
explain analyze
select * from record_changes_log_detail 
where old_value::text LIKE '%Gold Kerrison Neuro%';

哈希索引提供了最好的语法,因为它是透明的...但是哈希索引在其他方式。这是尺码搜索和结果。我在这里手动添加了报告的索引建立时间。

The hash index provides the nicest syntax because it's transparent...but the hash index is the worst in every other way. Here's a size search and results. I've added reported index building times manually here.

select
'B-tree on LEFT(old_value,1024)::citext' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_btree')) as pretty

union all

select
'B-tree on HASHTEXT(old_value)' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_hash_ix_btree')) as pretty

union all

select
'Hash index on old_value' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_hash')) as pretty

union all

select
'GIN tri-gram index on old_value' as index_description,
pg_size_pretty(pg_relation_size ('record_changes_log_detail_old_value_ix_tgrm')) as pretty;


index_description                       pretty  seconds
B-tree on LEFT(old_value,1024)::citext  238 MB       38
B-tree on HASHTEXT(old_value)           166 MB        7
Hash index on old_value                 362 MB    3,802
GIN tri-gram index on old_value         106 MB       56

我会说此数据与哈希索引非常匹配,因此请不要将这些结果视为典型结果。尽管如此,时间和大小还是很糟糕的。 =搜索的明显赢家是Erwin Brandstetter聪明地建议B树哈希。真好!搜索所需的额外语法糖在这里不如基于LEFT的索引那么糟糕。展望未来,这将从PG 12中承诺的B树改进中受益。

I'd say this data is a terrible match for a hash index, so please don't take those results as typical. Still, the time and size are pretty bad. The clear winner for = searches is Erwin Brandstetter's clever suggestion to B-tree a hash. Nice! The extra syntactic sugar needed for the search isn't as bad here as for the LEFT-based index. Looking forward, this will benefit from the B-tree improvements promised in PG 12.

还有更多好消息,三叉戟索引很棒!。劳伦兹·阿尔伯(Laurenz Albe)建议尝试一下,我很高兴自己做了。即时包含/类似搜索,完美。这就是我所需要的。再一次,我怀疑索引大小是否是典型的...我的数据很奇怪。对于使用citext的用户,请注意,必须将搜索条件转换为要使用的索引的文本:

And more good news, the tri-gram index is awesome. Laurenz Albe suggested trying it out, and I'm happy I did. Instantaneous contains/like searches, perfect. That's just what I need. Here again, I doubt that the index size is typical...my data is weird. For those using citext, note that you have to cast the search condition to text for the index to be used:

select * from record_changes_log_detail 
where old_value::text LIKE '%Gold Kerrison Neuro%';

对于那些不知道的人,三元组是长度为n个元组的一个实例N-gram有时称为q-gram或k-gram。无论如何,都是一样的。在所有幼稚的(非概率性或统计性)模糊文本匹配算法中,它可能是最好的。强大,灵活,功能强大的各种数据集和语言。因此,我对它在Postgres中的工作效果感到非常满意。

For those who don't know, tri-grams are an instance of n-grams with a length of 3. N-grams are sometimes called q-grams or k-grams. Whatever, it's the same thing. Of all of the naive (non-probabilistic or statistical) fuzzy text matching algorithms, it's probably the best. Robust over different data sets and languages, flexible, awesome. So I'm super pleased at how well it works in Postgres.

推荐答案

就像您从其他地方的jjanes中读到的那样:仅当表达式在查询谓词中完全匹配时才考虑使用index。 Postgres查询计划程序不是AI。如果计划的时间太长,很快就会使快速查询的目的失效。

It's just like you read from jjanes elsewhere: an expression index is only considered if the expression is matched in the query predicate exactly. The Postgres query planner is not an AI. It would quickly defeat the purpose of making queries fast if planning them takes too long.

如果可以的话,您可以稍微优化一下索引。 left() substring()更简单,更快:

You can optimize your index a bit, if that's any consolation. left() is simpler and faster than substring():

CREATE INDEX record_changes_log_detail_old_value_ix_btree
ON record_changes_log_detail (left(old_value,1024) text_pattern_ops);

btree索引的最大行大小为2704字节 ,而不是 B树的2172个字符限制。

Also, there is a maximum row size of 2704 bytes for btree indexes, not a "2172 character limit on B-trees".

最重要的是,仅用于相等性检查(如您的问题所示)使用 md5(old_value) hashtext(old_value)在哈希值上的索引将是很多更高效。如果这样做,请记住像这样防御哈希冲突

Most importantly, for only equality checks, like your question suggests, a btree index on a hash value using md5(old_value) or hashtext(old_value) would be much more efficient. If you do, remember to defend against hash collisions like so:

SELECT *
FROM   record_changes_log_detail 
WHERE  hashtext(old_value) = hashtext('Gold Kerrison Neuro')
AND    old_value = 'Gold Kerrison Neuro';

第一个谓词使您可以快速访问索引。第二个排除误报。碰撞应该非常罕见。但是可能。

The first predicate gives you fast index access. The second excludes false positives. Collisions should be extremely rare. But possible. And the possibility grows with the size of the table.

相关:

  • SELECT query with DISTINCT on a table-structure for graphs is very slow
  • What is the optimal data type for an MD5 field?
  • Full-text search in CouchDB

哈希索引就像您已经考虑过自己一样:

Or a hash index like you have been considering yourself already:

  • Why is a Postgres 11 hash index so large?

(这里您不需要担心哈希冲突;在内部处理。)

(Here you don't need to worry about hash collisions; handled internally.)

这篇关于搜索表达索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆