如何提高单个字符的PrefixQuery性能? [英] How to improve a single character PrefixQuery performance?

查看:68
本文介绍了如何提高单个字符的PrefixQuery性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含150万个文档的RAMDirectory,并且我正在使用PrefixQuery搜索单个字段.当搜索文本的长度为3个或更多字符时,搜索将非常快,少于20毫秒.但是,当搜索文字的长度少于3个字符时,搜索甚至可能需要花费整整1秒钟的时间.

I have a RAMDirectory with 1.5 million documents and I'm searching using a PrefixQuery for a single field. When the search text has a length of 3 or more characters, the search is extremely fast, less than 20 milliseconds. But when the search text has a length of less than 3 characters, the search might take even a full 1 second.

由于它是自动完成功能,并且用户以一个字符开头(并且结果的确是1个字符长),因此我不能限制搜索文本的长度.

Since it's an auto complete feature and the user starts with one character (and there are results that are indeed 1 char length), I cannot restrict the length of the search text.

代码差不多:

var symbolCodeTopDocs = searcher.Search(new PrefixQuery(new Term("SymbolCode", searchText), 10);

SymbolCode是一个NOT_ANALYZED字段. Lucene.NET版本是3.0.3.

The SymbolCode is a NOT_ANALYZED field. The Lucene.NET version is 3.0.3.

该示例已简化,在现实世界中,我可能不得不使用BooleanQuery来应用其他约束.

The example is simplified, and I might have to use a BooleanQuery to apply additional constraints in a real world scenario.

如何在这种情况下提高性能?这些单字符或两字符查询使服务器宕机.

How can I improve performance on this specific case? These single-char or two-char queries are bringing the server down.

推荐答案

请考虑从索引中删除停用词.

Consider removing stop words from your index if you haven't already.

要了解停用词如何降低PrefixQuery的速度,然后考虑PrefixQuery的工作方式:将其重写为BooleanQuery,其中包括索引中以PrefixQuery的术语开头的每个术语.例如,a*变为a OR and OR aardvark OR anchor OR ...到目前为止,这还不错,即使有成千上万的术语,它的表现也将令人惊讶.真正的浪费是当包含停用词(如aand)时,因为它们可能在索引中的每个文档中多次出现.这会为搜索的收集/收集/计分部分创造更多的工作,从而减慢速度.

To understand how stop words slow down PrefixQuery then consider how PrefixQuery works: It is rewritten as a BooleanQuery that includes every term from the index beginning with the PrefixQuery's term. For example a* becomes a OR and OR aardvark OR anchor OR ... So far this isn't bad and it will perform surprisingly well even with thousands of terms. The real drain is when stop words like a and and are included because they'll likely be found multiple times in every single document in your index. This creates a lot more work for the gathering/collecting/scoring portion of the search and thus slows things down.

在旁注中,我强烈建议当用户输入的字符数少于2或3个字符时(完全从可用性角度考虑)运行自动完成搜索.我无法想象结果将是完全相关的.想象一下运行a*的搜索-无法确定哪个结果更相关.如果您必须向用户显示某些内容,请考虑使用n-gram方法,如评论中建议的Jf Beaulac.

On a side note, I highly recommend not running the autocomplete search when the user has entered less than 2 or 3 characters, purely from a usability perspective. I can't imagine the results would be at all relevant. Imagine running a search for a* -- there's no way to tell which results are more relevant. If you must display something to the user then consider an n-gram approach like Jf Beaulac suggested in the comments.

这篇关于如何提高单个字符的PrefixQuery性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆