如何提高单字符 PrefixQuery 的性能? [英] How to improve a single character PrefixQuery performance?

查看：24 发布时间：2022/1/15 13:03:18 lucene lucene.net

本文介绍了如何提高单字符 PrefixQuery 的性能?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个包含 150 万个文档的 RAMDirectory，我正在使用 PrefixQuery 搜索单个字段.当搜索文本长度为 3 个或更多字符时，搜索速度极快，不到 20 毫秒.但是当搜索文本的长度少于 3 个字符时，搜索可能需要整整 1 秒.

I have a RAMDirectory with 1.5 million documents and I'm searching using a PrefixQuery for a single field. When the search text has a length of 3 or more characters, the search is extremely fast, less than 20 milliseconds. But when the search text has a length of less than 3 characters, the search might take even a full 1 second.

由于它是一项自动完成功能，并且用户从一个字符开始(并且确实有 1 个字符长度的结果)，因此我无法限制搜索文本的长度.

Since it's an auto complete feature and the user starts with one character (and there are results that are indeed 1 char length), I cannot restrict the length of the search text.

代码差不多:

var symbolCodeTopDocs = searcher.Search(new PrefixQuery(new Term("SymbolCode", searchText), 10);

SymbolCode 是一个 NOT_ANALYZED 字段.Lucene.NET 版本是 3.0.3.

The SymbolCode is a NOT_ANALYZED field. The Lucene.NET version is 3.0.3.

该示例已简化，我可能必须使用 BooleanQuery 在真实场景中应用额外的约束.

The example is simplified, and I might have to use a BooleanQuery to apply additional constraints in a real world scenario.

在这种特定情况下如何提高性能?这些单字符或双字符查询导致服务器停机.

How can I improve performance on this specific case? These single-char or two-char queries are bringing the server down.

推荐答案

如果您还没有，请考虑从索引中删除停用词.

Consider removing stop words from your index if you haven't already.

要了解停用词如何减慢 PrefixQuery 的速度，请考虑 PrefixQuery 的工作原理:它被重写为 BooleanQuery，其中包括索引中以 PrefixQuery 的术语开头的每个术语.例如 a* 变成 a OR and OR aardvark OR anchor OR ... 到目前为止，这还不错，即使有数千个术语，它的性能也会出奇的好.真正的消耗是当包含像 a 和 and 这样的停用词时，因为它们可能会在索引中的每个文档中多次找到.这为搜索的收集/收集/评分部分创造了更多的工作，从而减慢了速度.

To understand how stop words slow down PrefixQuery then consider how PrefixQuery works: It is rewritten as a BooleanQuery that includes every term from the index beginning with the PrefixQuery's term. For example a* becomes a OR and OR aardvark OR anchor OR ... So far this isn't bad and it will perform surprisingly well even with thousands of terms. The real drain is when stop words like a and and are included because they'll likely be found multiple times in every single document in your index. This creates a lot more work for the gathering/collecting/scoring portion of the search and thus slows things down.

附带说明，我强烈建议不要在用户输入少于 2 或 3 个字符时运行自动完成搜索，这纯粹是从可用性的角度来看.我无法想象结果会是完全相关的.想象一下对 a* 进行搜索——无法判断哪些结果更相关.如果您必须向用户显示某些内容，请考虑使用评论中建议的 Jf Beaulac 之类的 n-gram 方法.

On a side note, I highly recommend not running the autocomplete search when the user has entered less than 2 or 3 characters, purely from a usability perspective. I can't imagine the results would be at all relevant. Imagine running a search for a* -- there's no way to tell which results are more relevant. If you must display something to the user then consider an n-gram approach like Jf Beaulac suggested in the comments.

这篇关于如何提高单字符 PrefixQuery 的性能?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何提高单字符 PrefixQuery 的性能? [英] How to improve a single character PrefixQuery performance?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何提高单字符 PrefixQuery 的性能? [英] How to improve a single character PrefixQuery performance?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭