如何处理SOLR中高亮片段中的html标签 [英] How to handle html tags in highlight fragment in SOLR
问题描述
我使用SOLR命中突出显示功能来设置与查询匹配的文档中的突出显示.
I use SOLR hit highlighting feature to set highlights in a document matches the query.
问题是其中一个字段包含有效的HTML,但是返回的突出显示片段不是有效的HTML,这就是为什么在渲染整个页面布局后出现的问题.
The problem is one of the fields contain valid HTML, but highlight fragment returned is not valid HTML, that's why after the rendering whole page layout is broken.
例如查询field:lucene
,请给我这份文件:
For example query field:lucene
get me this document:
<p><a href="/some/link">Here is the discussion, what the difference between SOLR, Elasticsearch and Lucene</a></p>
突出显示的片段是Elasticsearch and <em>Lucene</em></a></p>
.
我尝试设置片段大小= 0(返回整个字段内容)的选项之一,但是它可能非常大,结果页面只需要几段代码即可.
One of the option I've tried to set fragment size = 0 (return whole field content) but it can be very large and I need just a few snippets for the result page.
另一个选择是删除所有HTML标记并以纯文本显示代码段,但是我需要<em>
标记来突出显示.另外,某些标签可能会像</p
那样被打断,这意味着我们不能为此目的使用html解析器.
Another option is to remove all HTML tags and show snippet in plain text, but I need <em>
tags for highlighting. Also some tags could be broken in fragment like a </p
that means we can't use html parsers for that purpose.
这似乎是搜索中的常见问题,是否有一些最先进的方法来处理呢?
It seems like a common problem in search, is there some state-of-the-art approach to handle that?
推荐答案
The usual solution is to strip HTML on the way in (for example using the HTMLCharFilter), before indexing. That way you'll have a plain text field that you can do highlighting on, and display the result with the embedded <em>
tags.
然后,您可以使用copyField
保留一个完整的HTML表示形式的字段,而其中一个不包含HTML的字段(用于突出显示).
You can then use copyField
to have a field with the HTML representation intact, and one without the HTML contained (to use for highlighting).
这篇关于如何处理SOLR中高亮片段中的html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!