Lucene的Hightlighter有时莫名其妙地返回空白片段 [英] Lucene Hightlighter sometimes inexplicably returns blank fragments

查看:625
本文介绍了Lucene的Hightlighter有时莫名其妙地返回空白片段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直工作在过去几天Lucene的文件搜索程序,一切都已经全面向好,一直到现在。我试图使用 Lucene.Net.Highlight.Highlighter 类显示相关的片段为我的搜索结果,但它不工作始终。 的时代的呼唤 Highlighter.GetBestFragments()不正是我期望(显示在他们指定的查询字符串相关的文字片段),但有时它只是返回一个空字符串。

I've been working on a Lucene document search program for the last few days and everything has been overall going well, until now. I'm trying to use the Lucene.Net.Highlight.Highlighter class to show relevant snippets for my search results, but it isn't working consistently. Most of the time the calling Highlighter.GetBestFragments() does exactly what I'd expect (shows relevant text snippets with the given query string in them), but sometimes it just returns an empty string.

我三重检查我的投入,我可以确认我使用的查询字符串在输入文本存在,但荧光笔只是随意有时会返回一个空字符串。问题是可再现的;有返回将继续有空白段空白片段文件使用相同的查询时,虽然有合法的碎片文件继续拥有合法的片段回来了。

I've triple checked my inputs and I can verify that the query string I'm using exists in the input text, but the highlighter just arbitrarily returns an empty string sometimes. The problem is reproducible; documents that have blank fragments returned will continue to have blank fragments returned when using the same query, while documents that have legitimate fragments continue to have legitimate fragments.

然而,问题不记录特定的。某些查询返回有效的片段为在其他的查询返回一个空字符串为同一文档的文档。该问题也不会出现可能与我的分析仪;这个问题表明了我是否使用 StandardAnalyzer SnowballAnalyzer

However, The problem is NOT document-specific. Some queries return valid fragments for a document where other queries return an empty string for the same document. The problem also does not appear to be related to my analyzer; the problem shows up whether I use a StandardAnalyzer or a SnowballAnalyzer.

很多时间四处我一直无法找到查询/失败与那些工作证件的任何模式后。请记住,这是上是专门从Lucene索引的拉回使用完全相同的查询文件发生。这意味着搜索器能够找到目标文件相关的查询字符串,但荧光笔不是。

After many hours of poking around I have been unable to find any pattern in the queries/documents that fail versus those that work. Keep in mind that this is happening on documents that were specifically pulled back from the Lucene index using the exact same query. That means the Searcher is able to find the relevant query string in the target document but the Highlighter is not.

这是在Lucene的一个bug? ?如果是这样,我怎么能解决它。

Is this a bug in Lucene? If so, how can I work around it?

我的代码:

private static SimpleHTMLFormatter _formatter = new SimpleHTMLFormatter("<b>", "</b>");
private static SimpleFragmenter _fragmenter = new SimpleFragmenter(50);
...
{
    using (var searcher = new IndexSearcher(analyzerInfo.Directory, false))
    {
        QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "Text", analyzerInfo.Analyzer);
        parser.SetMultiTermRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);

        //build query
        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.Add(new TermQuery(new Term("PageNum", "0")), BooleanClause.Occur.MUST);
        booleanQuery.Add(parser.Parse(searchQuery), BooleanClause.Occur.MUST);
        Query query = booleanQuery.Rewrite(searcher.GetIndexReader());

        //get results from query
        ScoreDoc[] hits = searcher.Search(query, 50).ScoreDocs;
        List<DVDoc> results = hits.Select(hit => MapLuceneDocumentToData(searcher.Doc(hit.Doc))).ToList();

        //add relevant fragments to search results (shows WHY a certain result was chosen)
        QueryScorer scorer = new QueryScorer(query);
        Highlighter highlighter = new Highlighter(_formatter, scorer);
        highlighter.SetTextFragmenter(_fragmenter);
        foreach (DVDoc result in results)
        {
            TokenStream stream = analyzerInfo.Analyzer.TokenStream("Text", new StringReader(result.Text));
            result.RelevantFragments = highlighter.GetBestFragments(stream, result.Text, 3, "...");
        }

        //clean up
        analyzerInfo.Analyzer.Close();
        searcher.Close();

        return results;
    }
}



(注: DVDoc 本质上只是用于存储有关文件资料中发现的一个结构。该方法 MapLuceneDocumentToData 变成一个Lucene 文件进入我的自定义 DVDoc 类,没有魔法存在。)

(Note: DVDoc is essentially just a struct which stores info about documents that were found. The method MapLuceneDocumentToData turns a Lucene Document into my custom DVDoc class, no magic there.)

此外,由于每个人都喜欢例如输入和输出

And since everyone likes example inputs and outputs:

  • Example of GetBestFragments working
  • Example of GetBestFragments NOT working

我用Lucene.NET版本2.9.4g。

I'm using Lucene.NET Version 2.9.4g.

推荐答案

默认情况下,荧光笔将只处理第51200字符文档的。

By default the Highlighter will only process the first 51200 chars of a Document.

要增加此限制,将 MaxDocCharsToAnalyze 属性。

To increase this limit, set the MaxDocCharsToAnalyze property.

http://lucene.apache.org/core/old_versioned_docs/versions/2_9_2/api/contrib-highlighter/org/apache/lucene/search/highlight/Highlighter.html#setMaxDocCharsToAnalyze(INT

这篇关于Lucene的Hightlighter有时莫名其妙地返回空白片段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆