了解 Lucene 领先的通配符性能 [英] Understanding Lucene leading wildcard performance

查看:24
本文介绍了了解 Lucene 领先的通配符性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Lucene 默认不允许在搜索词中使用前导通配符,但这可以启用:

Lucene does not by default allow leading wildcards in search terms, but this can be enabled with:

QueryParser#setAllowLeadingWildcard(true)

我知道使用前导通配符会阻止 Lucene 使用索引.带有前导通配符的搜索必须扫描整个索引.

I understand that use of a leading wildcard prevents Lucene from using the index. Searches with a leading wildcard must scan the entire index.

如何展示前导通配符查询的性能?什么时候可以使用 setAllowLeadingWildcard(true)?

我已经建立了一个包含 1000 万个文档的测试索引,格式如下:

I have built a test index with 10 million documents in the form:

{ name: random_3_word_phrase }

磁盘索引为360M.

我的测试查询表现良好,但我一直无法实际演示性能问题.例如,查询 name:*ing 会在不到 1 秒的时间内生成超过 110 万个文档.查询 name:*ing* 同时产生超过 150 万个文档.

My test queries perform well and I have been unable to actually demonstrate a performance problem. For example, querying for name:*ing produces over 1.1 million documents in less than 1 second. Querying name:*ing* produces over 1.5 million documents in the same time.

这是怎么回事?为什么这么慢?10,000,000 份文件还不够吗?文档是否需要包含多个字段?

What is going here? Why isn't this slow? Is 10,000,000 documents not enough? Do the documents need to contains more than just a single field?

推荐答案

取决于你有多少内存,以及内存中有多少令牌索引.

Depends on how much memory you have, and how much of the token index is in memory.

可以在任何旧计算机上快速搜索 360MB 的总索引.360GB 的索引需要更长的时间...;)

A 360MB total index could be searched quite quickly on any old computer. A 360GB index would take a bit longer... ;)

例如,我启动了一个旧的 2GB 索引,然后搜索*e".

As an example, I fired up an old 2GB index, and searched for "*e".

在一个 8GB 的​​盒子上,它在 5 秒内返回了 500K 次点击.我在一个只有 1GB 内存的盒子上尝试了相同的索引,大约用了 20 秒.

On a box with 8GB, it returned 500K hits in under 5 seconds. I tried the same index on a box with only 1GB of memory, and it took about 20 seconds.

为了进一步说明,这里有一些通用的 C# 代码,它基本上对 1000 万个随机 3 词短语进行** E*"类型的搜索.

To illustrate further, here's some generic C# code that basically does a "** E*" type search of 10 million random 3 word phrases.

static string substring = "E";

private static Random random = new Random((int)DateTime.Now.Ticks);//thanks to McAden

private static string RandomString(int size)
{
    StringBuilder builder = new StringBuilder();
    char ch;
    for (int i = 0; i < size; i++)
    {
        ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
        builder.Append(ch);
    }

    return builder.ToString();
}

static void FindSubStringInPhrases()
{
    List<string> index = new List<string>();

    for (int i = 0; i < 10000000; i++)
    {
        index.Add(RandomString(5) + " " + RandomString(5) + " " + RandomString(5));
    }

    var matches = index.FindAll(SubstringPredicate);

}

static bool SubstringPredicate(string item)
{
    if (item.Contains(substring))
        return true;
    else
        return false;
}

在将所有 1000 万个阶段加载到列表中后,var matching = index.FindAll(SubstringPredicate);"仍然只需要大约一秒钟的时间返回超过 400 万次点击.

After all 10 million phases have been loaded into the list, it still only takes about a second for "var matches = index.FindAll(SubstringPredicate);" to return over 4 million hits.

重点是,内存很快.一旦事情不再适合内存并且您必须开始交换到磁盘,您就会看到性能下降.

The point is, memory is fast. Once things can no longer fit into memory and you have to start swapping to disk is when you are going to see performance hits.

这篇关于了解 Lucene 领先的通配符性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆