加快通配符文本查找 [英] speeding up wildcard text lookups

查看:90
本文介绍了加快通配符文本查找的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Postgres中有一个简单的表,其中有超过800万行.感兴趣的列包含短文本字符串,通常一个或多个单词的总长度小于100个字符.设置为字符变化(100)".该列已建立索引.像下面这样简单的查找需要花费3000毫秒以上的时间.

I have a simple table in Postgres with a bit over 8 million rows. The column of interest holds short text strings, typically one or more words total length less than 100 characters. It is set as 'character varying (100)'. The column is indexed. A simple look up like below takes > 3000 ms.

SELECT a, b, c FROM t WHERE a LIKE '?%'

是的,目前,仅需要简单地查找以输入的文本开头的"a"行.我想将查找速度降低到100毫秒以下(瞬时出现).有什么建议吗?在我看来,全文搜索在这里无济于事,因为我的文字栏太短了,但是如果值得的话,我很乐意尝试一下.

Yes, for now, the need is to simply find the rows where "a" starts with the entered text. I want to bring the speed of look up down to under 100 ms (the appearance of instantaneous). Suggestions? Seems to me that full text search won't help here as my column of text is too short, but I would be happy to try that if worthwhile.

哦,顺便说一句,我还在mongodb和索引列"a"中加载了完全相同的数据.在mongodb中加载数据的速度非常快(mongodb ++).执行精确查找时,mongodb和Postgres几乎都是瞬时的.但是,Postgres在进行上述的尾随通配符搜索时实际上会发光,持续花费的时间大约是mongodb的1/3.如果可以加快速度,我将很乐意追求mongodb,因为这只是一个只读操作.

Oh, btw I also loaded the exact same data in mongodb and indexed column "a". Loading the data in mongodb was amazingly quick (mongodb++). Both mongodb and Postgres are pretty much instantaneous when doing exact lookups. But, Postgres actually shines when doing trailing wildcard searches as above, consistently taking about 1/3 as long as mongodb. I would be happy to pursue mongodb if I could speed that up as this is only a readonly operation.

更新:首先,几个EXPLAIN ANALYZE输出

EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE a LIKE 'abcd%'

"Seq Scan on t  (cost=0.00..282075.55 rows=802 width=40) 
    (actual time=1220.132..1220.132 rows=0 loops=1)"
"  Filter: ((a)::text ~~ 'abcd%'::text)"
"Total runtime: 1220.153 ms"

我实际上想将Lower(a)与总是至少4个字符长的搜索词进行比较,所以

I actually want to compare Lower(a) with the search term which is always at least 4 characters long, so

EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE Lower(a) LIKE 'abcd%'

"Seq Scan on t  (cost=0.00..302680.04 rows=40612 width=40) 
    (actual time=4.681..3321.387 rows=788 loops=1)"
"  Filter: (lower((a)::text) ~~ 'abcd%'::text)"
"Total runtime: 3321.504 ms"

所以我创建了一个索引

CREATE INDEX idx_t ON t USING btree (Lower(Substring(a, 1, 4) ));

"Seq Scan on t  (cost=0.00..302680.04 rows=40612 width=40) 
    (actual time=3243.841..3243.841 rows=0 loops=1)"
"  Filter: (lower((a)::text) = 'abcd%'::text)"
"Total runtime: 3243.860 ms"

似乎唯一使用索引的时间是在我寻找完全匹配的时候

Seems the only time an index is being used is when I am looking for an exact match

EXPLAIN ANALYZE SELECT a, b, c FROM t WHERE a = 'abcd'

"Index Scan using idx_t on geonames  (cost=0.00..57.89 rows=13 width=40) 
    (actual time=40.831..40.923 rows=17 loops=1)"
"  Index Cond: ((ascii_name)::text = 'Abcd'::text)"
"Total runtime: 40.940 ms"

通过使用varchar_pattern_ops实现索引来找到解决方案现在正在寻找更快的查询.

推荐答案

PostgreSQL查询计划器很聪明,但不是AI.要使其在表达式上使用索引,请在查询中使用完全相同的表达式形式.

The PostgreSQL query planner is smart, but not an AI. To make it use an index on an expression use the exact same form of expression in the query.

具有这样的索引:

CREATE INDEX t_a_lower_idx ON t (lower(substring(a, 1, 4)));

或更简单的PostgreSQL 9.1:

Or simpler in PostgreSQL 9.1:

CREATE INDEX t_a_lower_idx ON t (lower(left(a, 4)));

使用此查询:

SELECT * FROM t WHERE lower(left(a, 4)) = 'abcd';

在功能上100%等同于:

Which is 100% functionally equivalent to:

SELECT * FROM t WHERE lower(a) LIKE 'abcd%'

或者:

SELECT * FROM t WHERE a ILIKE 'abcd%'


但是不是:

SELECT * FROM t WHERE a LIKE 'abcd%'

这是一个功能上不同的查询,您需要一个不同的索引:

This is a functionally different query and you need a different index:

CREATE INDEX t_a_idx ON t (substring(a, 1, 4));

或更简单的PostgreSQL 9.1:

Or simpler with PostgreSQL 9.1:

CREATE INDEX t_a_idx ON t (left(a, 4));

并使用以下查询:

SELECT * FROM t WHERE left(a, 4) = 'abcd';


可变长度的左锚搜索词

不区分大小写.索引:


Left anchored search terms of variable length

Case insensitive. Index:

编辑:几乎忘记了:如果使用默认'C'以外的任何其他语言环境运行数据库,则需要

Edit: Almost forgot: If you run your db with any other locale than the default 'C', you need to specify the operator class explicitly - text_pattern_ops in my example:

CREATE INDEX t_a_lower_idx
ON t (lower(left(a, <insert_max_length>)) text_pattern_ops);

查询:

SELECT * FROM t WHERE lower(left(a, <insert_max_length>)) ~~ 'abcdef%';

可以利用索引,并且几乎与固定长度的变体一样快.

Can utilize the index and is almost as fast as the variant with a fixed length.

您可能对此dba.SE上的帖子感兴趣,特别是有关模式匹配的详细信息关于运算符~>=~~<~的部分.

You may be interested in this post on dba.SE with more details about pattern matching, especially the last part about the operators ~>=~ and ~<~.

这篇关于加快通配符文本查找的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆