lucene中非常慢的高光表现 [英] very slow highlight performance in lucene

查看:204
本文介绍了lucene中非常慢的高光表现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

当搜索常用词时,Lucene(4.6)荧光笔的性能非常慢. 搜索速度很快(100毫秒),但高亮显示可能需要一个多小时(!).

Lucene (4.6) highlighter has very slow performance, when a frequent term is searched. Search is fast (100ms), but highlight may take more than an hour(!).

详细信息:使用了出色的文本语料库(1.5GB纯文本).性能与文本是否被分割成更多小段无关. (也分别测试了500MB和5MB.) 位置和偏移量被存储. 如果搜索非常频繁的术语或模式,则会快速检索TopDoc(100毫秒),但是每个"searcher.doc(id)"调用都非常昂贵(5至50秒),而getBestFragments()则非常昂贵(超过1小时) .甚至为此存储和索引它们. (硬件:核心i7、8GM内存)

Details: great text corpus was used (1.5GB plain text). Performance doesn't depend if text is splitted into more small pieces or not. (Tested with 500MB and 5MB pieces as well.) Positions and offsets are stored. If a very frequent term or pattern is searched, TopDocs are retrieved fast (100ms), but each "searcher.doc(id)" calls are expensive (5-50s), and getBestFragments() are extremely expensive (more than 1 hour). Even they are stored and indexed for this purpose. (hardware: core i7, 8GM mem)

背景更深: 它将用于语言分析研究.使用特殊的词干:它也存储语音信息的一部分.例如,如果搜索"adj adj adj adj名词" ,它会在所有带有上下文的文本中给出

Greater background: it would serve a language analysis research. A special stemming is used: it stores the part of speech info, too. For example if "adj adj adj adj noun" is searched, it gives all its occurrences in the text with context.

我可以调整其性能,还是应该选择其他工具?

使用的代码:

            //indexing
            FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
            offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

            offsetsType.setStored(true);
            offsetsType.setIndexed(true);
            offsetsType.setStoreTermVectors(true);
            offsetsType.setStoreTermVectorOffsets(true);
            offsetsType.setStoreTermVectorPositions(true);
            offsetsType.setStoreTermVectorPayloads(true);


            doc.add(new Field("content", fileContent, offsetsType));


            //quering
            TopDocs results = searcher.search(query, limitStart+limit);

            int endPos = Math.min(results.scoreDocs.length, limitStart+limit);
            int startPos = Math.min(results.scoreDocs.length, limitStart);

            for (int i = startPos; i < endPos; i++) {
                int id = results.scoreDocs[i].doc;

                // bottleneck #1 (5-50s):
                Document doc = searcher.doc(id);

                FastVectorHighlighter h = new FastVectorHighlighter();

                // bottleneck #2 (more than 1 hour):   
                String[] hs = h.getBestFragments(h.getFieldQuery(query), m, id, "content", contextSize, 10000);

相关(未回答)问题: https://stackoverflow.com/questions/19416804 /very-slow-solr-performance-when-highlighting

Related (unanswered) question: https://stackoverflow.com/questions/19416804/very-slow-solr-performance-when-highlighting

推荐答案

BestFragments依赖于所使用的分析器完成的标记化.如果必须分析这么大的文本,则最好在建立索引时存储术语向量WITH_POSITIONS_OFFSETS.

BestFragments relies on the tokenization done by the analyzer that you're using. If you have to analyse such a big text, you'd better to store term vector WITH_POSITIONS_OFFSETS at indexing time.

请阅读这本书

这样做,您无需在运行时分析所有文本,因为您可以选择一种方法来重用现有的术语向量,这将减少突出显示时间.

By doing that, you won't need to analyze all the text at runtime as you can pick up a method to reuse the existing term vector and this will reduce the highlighting time.

这篇关于lucene中非常慢的高光表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆