lucene中非常慢的高光表现 [英] very slow highlight performance in lucene
问题描述
当搜索常用词时,Lucene(4.6)荧光笔的性能非常慢. 搜索速度很快(100毫秒),但高亮显示可能需要一个多小时(!).
Lucene (4.6) highlighter has very slow performance, when a frequent term is searched. Search is fast (100ms), but highlight may take more than an hour(!).
详细信息:使用了出色的文本语料库(1.5GB纯文本).性能与文本是否被分割成更多小段无关. (也分别测试了500MB和5MB.) 位置和偏移量被存储. 如果搜索非常频繁的术语或模式,则会快速检索TopDoc(100毫秒),但是每个"searcher.doc(id)"调用都非常昂贵(5至50秒),而getBestFragments()则非常昂贵(超过1小时) .甚至为此存储和索引它们. (硬件:核心i7、8GM内存)
Details: great text corpus was used (1.5GB plain text). Performance doesn't depend if text is splitted into more small pieces or not. (Tested with 500MB and 5MB pieces as well.) Positions and offsets are stored. If a very frequent term or pattern is searched, TopDocs are retrieved fast (100ms), but each "searcher.doc(id)" calls are expensive (5-50s), and getBestFragments() are extremely expensive (more than 1 hour). Even they are stored and indexed for this purpose. (hardware: core i7, 8GM mem)
背景更深: 它将用于语言分析研究.使用特殊的词干:它也存储语音信息的一部分.例如,如果搜索"adj adj adj adj名词" ,它会在所有带有上下文的文本中给出
Greater background: it would serve a language analysis research. A special stemming is used: it stores the part of speech info, too. For example if "adj adj adj adj noun" is searched, it gives all its occurrences in the text with context.
我可以调整其性能,还是应该选择其他工具?
使用的代码:
//indexing
FieldType offsetsType = new FieldType(TextField.TYPE_STORED);
offsetsType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
offsetsType.setStored(true);
offsetsType.setIndexed(true);
offsetsType.setStoreTermVectors(true);
offsetsType.setStoreTermVectorOffsets(true);
offsetsType.setStoreTermVectorPositions(true);
offsetsType.setStoreTermVectorPayloads(true);
doc.add(new Field("content", fileContent, offsetsType));
//quering
TopDocs results = searcher.search(query, limitStart+limit);
int endPos = Math.min(results.scoreDocs.length, limitStart+limit);
int startPos = Math.min(results.scoreDocs.length, limitStart);
for (int i = startPos; i < endPos; i++) {
int id = results.scoreDocs[i].doc;
// bottleneck #1 (5-50s):
Document doc = searcher.doc(id);
FastVectorHighlighter h = new FastVectorHighlighter();
// bottleneck #2 (more than 1 hour):
String[] hs = h.getBestFragments(h.getFieldQuery(query), m, id, "content", contextSize, 10000);
相关(未回答)问题: https://stackoverflow.com/questions/19416804 /very-slow-solr-performance-when-highlighting
Related (unanswered) question: https://stackoverflow.com/questions/19416804/very-slow-solr-performance-when-highlighting
推荐答案
BestFragments依赖于所使用的分析器完成的标记化.如果必须分析这么大的文本,则最好在建立索引时存储术语向量WITH_POSITIONS_OFFSETS
.
BestFragments relies on the tokenization done by the analyzer that you're using. If you have to analyse such a big text, you'd better to store term vector WITH_POSITIONS_OFFSETS
at indexing time.
这样做,您无需在运行时分析所有文本,因为您可以选择一种方法来重用现有的术语向量,这将减少突出显示时间.
By doing that, you won't need to analyze all the text at runtime as you can pick up a method to reuse the existing term vector and this will reduce the highlighting time.
这篇关于lucene中非常慢的高光表现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!