进一步加快后缀通配符搜索的速度 [英] speeding up a postfix wildcard search even more

查看:134
本文介绍了进一步加快后缀通配符搜索的速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近问了一个问题有关加速后缀通配符文本查找的问题,例如在类似Pg的'abcde%'的位置选择a,b,c。最后,通过实现以下索引,我每次查询的时间介于200毫秒至800毫秒之间。

I recently asked a question about speeding up postfix wildcard text lookups such as SELECT a, b, c FROM t WHERE a LIKE 'abcde%' in Pg. Finally, by implementing the following index, I am able to get between 200 ms and 800 ms per query.

CREATE INDEX idxa ON t (Lower(a) varchar_pattern_ops);

如果可能的话,我现在有兴趣将查询速度提高一个数量级。大概在200-800微秒之间

I am now interested in speeding up the query by an order of magnitude, if possible; perhaps between 200-800 microseconds. Could this be done?

整个表大约有1 GB的原始文本(约800万行),甚至可以做得更小,因此可以轻松放入其中记忆。我可以在Pg之上实现一个缓存,该缓存会随着时间的流逝而增长吗?也许是内存缓存或其他内容。由于大多数缓存都具有精确的键查找,我该如何从缓存中进行通配符搜索?

The entire table is about 1 GB of raw text (~8 million+ rows), and can be made even smaller, so it could easily fit in memory. Could I implement a cache on top of Pg, a cache that would seed over time? Perhaps memcached or something else. Since most caches have an exact key lookup, how would I do a wildcard search from a cache?

顺便说一句,作为一个信息点,我确实将整个表加载到了Mongodb中,而在精确搜索 a = 'abcdefg',上面的Mongodb通配符搜索实际上不如Postgres。

Btw, as a point of info, I did load the entire table in Mongodb, and while I got very fast lookups on exact searches a = 'abcdefg', Mongodb's wildcard search as above was actually inferior to that of Postgres.

推荐答案

您仍然可以挤出更多内容。

You can still squeeze out some more.

首先,我通常建议使用数据类型 text 代替 varchar 。因此, text_pattern_ops 而不是 varchar_pattern_ops

Firstly, I would generally advise to use the data type text instead of varchar. So text_pattern_ops instead of varchar_pattern_ops. This won't affect performance though.

下一步,因为您的列最多包含100个字符,但是您只能使用前n个(20?)字符,则索引会更小,使用 lower(left(a,20)而不是 lower(a),就像我在回答您的前传问题中已经建议的那样。

Next, as your column has up to 100 characters, but you only use the first n (20?) characters, the index will be much smaller with lower(left(a, 20) instead of lower(a) as I already suggested in my answer to your prequel question.

索引搜索本身执行相同的操作,但是服务器必须访问磁盘或RAM上更多的页面,每个RAM或磁盘页面可容纳的行更少,因此每次查找都必须访问更多的页面。另外,页面将很快从缓存中退出,等等。这对于像您这样的大表尤为重要。将一个人可以搜索的字母范围限制为所需的最小值,这将为您提供以下信息:

The index search itself performs the same, but the server has to visit many more pages on disk or in RAM. Fewer rows will fit per RAM or disk page, so more pages have to be visited for every lookup. Also, pages will drop out of your cache sooner, etc. This is especially important with big tables like yours. Limit the range of letters one can search for to the required minimum. This leaves you with something like:

CREATE INDEX t_a_lower_left_idx ON t (lower(left(a, 20)) text_pattern_ops);






也可以使用我在〜> =〜和〜<〜 https://dba.stackexchange.com/a/10696/3684>我链接到的答案:


Also, you can use the special operators ~>=~ and ~<~ in your query like I demonstrate in the answer I linked to:

SELECT * FROM tbl WHERE lower(a) ~>=~ 'abcde' AND lower(a) ~<~ ('abcdf')

请注意第二个表达式中的'f'而不是'e'。问题是:如何根据语言环境'C'获得下一个字符?

Note the 'f' instead of the 'e' in the second expression. Question is: how do you get the "next" character according do locale 'C'?

SELECT chr(ascii('é')+1));

因此,您可以:

SELECT * FROM tbl WHERE lower(a) ~>=~ 'abcde'
                    AND lower(a) ~<~ ('abcd' || chr(ascii('e')+1))

我用一个拥有50万行的自然表进行了测试。产生650行的搜索词在第一次查询中花费4毫秒,在第二次查询中花费3毫秒。 非常很大程度上取决于找到多少行。此处仅产生1行的搜索项需要0.044毫秒。

I ran a test with a natural table holding half a million rows. A search term yielding 650 rows took 4 ms with the first query and 3 ms with the second. It very much depends how many rows are found. A search term yielding only 1 row takes 0.044 ms here.

因此,请限制搜索项的最小长度以禁止无用的查询总会产生太多的行。至少需要3个或4个字符。

Therefore, limit the minimum length of the search term to prohibit useless queries that would yield too many rows anyway. Like 3 or 4 characters minimum.

接下来,您可以集群这样的表:

Next, you can cluster your table like this:

CLUSTER tbl USING t_a_lower_left_idx

之后,我的测试用例花费了2.5毫秒而不是3毫秒。

After that, my testcase took 2.5 ms instead of 3 ms.

当然,所有有关性能优化的基本建议适用。

足够,您可能要考虑在其上创建表空间 ramdisk或tmpfs分区(Linux)并在其中创建索引,甚至将整个表放在那里。我相信您已经知道易失性介质对数据库的安全性。仅在您有能力丢失所有数据的情况下才这样做。

If the above is not enough, you might want to think about creating a tablespace on a ramdisk or a tmpfs partition (Linux) and create indexes there or even put your whole table there. I am sure you are aware of the security implications of a volatile medium for a database. Only do this if you can afford losing all your data.

CREATE INDEX t_a_lower_left_idx ON t (lower(left(a, 20)) text_pattern_ops)
TABLESPACE indexspace;

如果数据库设置正确且计算机具有足够的RAM,并且表已被大量读取,则标准缓存算法可能会自动提供大部分性能提升,而您从中不会获得太多收益。

If your database is set up properly and your machine has enough RAM and the table is read heavily, the standard caching algorithms may provide most of the performance gain automatically, and you won't gain much with this.

这篇关于进一步加快后缀通配符搜索的速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆