lucene中非常慢的高光表现 [英] very slow highlight performance in lucene

查看：204 发布时间：2020/5/4 7:28:25 java performance lucene highlight

本文介绍了lucene中非常慢的高光表现的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

当搜索常用词时，Lucene(4.6)荧光笔的性能非常慢. 搜索速度很快(100毫秒)，但高亮显示可能需要一个多小时(！).

Lucene (4.6) highlighter has very slow performance, when a frequent term is searched. Search is fast (100ms), but highlight may take more than an hour(!).

详细信息:使用了出色的文本语料库(1.5GB纯文本).性能与文本是否被分割成更多小段无关. (也分别测试了500MB和5MB.) 位置和偏移量被存储. 如果搜索非常频繁的术语或模式，则会快速检索TopDoc(100毫秒)，但是每个"searcher.doc(id)"调用都非常昂贵(5至50秒)，而getBestFragments()则非常昂贵(超过1小时) .甚至为此存储和索引它们. (硬件:核心i7、8GM内存)

Details: great text corpus was used (1.5GB plain text). Performance doesn't depend if text is splitted into more small pieces or not. (Tested with 500MB and 5MB pieces as well.) Positions and offsets are stored. If a very frequent term or pattern is searched, TopDocs are retrieved fast (100ms), but each "searcher.doc(id)" calls are expensive (5-50s), and getBestFragments() are extremely expensive (more than 1 hour). Even they are stored and indexed for this purpose. (hardware: core i7, 8GM mem)

背景更深: 它将用于语言分析研究.使用特殊的词干:它也存储语音信息的一部分.例如，如果搜索"adj adj adj adj名词" ，它会在所有带有上下文的文本中给出

Greater background: it would serve a language analysis research. A special stemming is used: it stores the part of speech info, too. For example if "adj adj adj adj noun" is searched, it gives all its occurrences in the text with context.

我可以调整其性能，还是应该选择其他工具?

使用的代码:

            //indexing
            FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
            offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

            offsetsType.setStored(true);
            offsetsType.setIndexed(true);
            offsetsType.setStoreTermVectors(true);
            offsetsType.setStoreTermVectorOffsets(true);
            offsetsType.setStoreTermVectorPositions(true);
            offsetsType.setStoreTermVectorPayloads(true);


            doc.add(new Field("content", fileContent, offsetsType));


            //quering
            TopDocs results = searcher.search(query, limitStart+limit);

            int endPos = Math.min(results.scoreDocs.length, limitStart+limit);
            int startPos = Math.min(results.scoreDocs.length, limitStart);

            for (int i = startPos; i < endPos; i++) {
                int id = results.scoreDocs[i].doc;

                // bottleneck #1 (5-50s):
                Document doc = searcher.doc(id);

                FastVectorHighlighter h = new FastVectorHighlighter();

                // bottleneck #2 (more than 1 hour):   
                String[] hs = h.getBestFragments(h.getFieldQuery(query), m, id, "content", contextSize, 10000);

Related (unanswered) question: https://stackoverflow.com/questions/19416804/very-slow-solr-performance-when-highlighting

lucene中非常慢的高光表现 [英] very slow highlight performance in lucene

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录关闭

lucene中非常慢的高光表现 [英] very slow highlight performance in lucene

问题描述

推荐答案

相关文章

Java开发最新文章

热门教程

热门工具

登录 关闭

登录关闭