如何处理SOLR中高亮片段中的html标签 [英] How to handle html tags in highlight fragment in SOLR

查看:158
本文介绍了如何处理SOLR中高亮片段中的html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用SOLR命中突出显示功能来设置与查询匹配的文档中的突出显示.

I use SOLR hit highlighting feature to set highlights in a document matches the query.

问题是其中一个字段包含有效的HTML,但是返回的突出显示片段不是有效的HTML,这就是为什么在渲染整个页面布局后出现的问题.

The problem is one of the fields contain valid HTML, but highlight fragment returned is not valid HTML, that's why after the rendering whole page layout is broken.

例如查询field:lucene,请给我这份文件:

For example query field:lucene get me this document:

<p><a href="/some/link">Here is the discussion, what the difference between SOLR, Elasticsearch and Lucene</a></p>

突出显示的片段是Elasticsearch and <em>Lucene</em></a></p>.

我尝试设置片段大小= 0(返回整个字段内容)的选项之一,但是它可能非常大,结果页面只需要几段代码即可.

One of the option I've tried to set fragment size = 0 (return whole field content) but it can be very large and I need just a few snippets for the result page.

另一个选择是删除所有HTML标记并以纯文本显示代码段,但是我需要<em>标记来突出显示.另外,某些标签可能会像</p那样被打断,这意味着我们不能为此目的使用html解析器.

Another option is to remove all HTML tags and show snippet in plain text, but I need <em> tags for highlighting. Also some tags could be broken in fragment like a </p that means we can't use html parsers for that purpose.

这似乎是搜索中的常见问题,是否有一些最先进的方法来处理呢?

It seems like a common problem in search, is there some state-of-the-art approach to handle that?

推荐答案

通常的解决方案是在途中剥离HTML(例如

The usual solution is to strip HTML on the way in (for example using the HTMLCharFilter), before indexing. That way you'll have a plain text field that you can do highlighting on, and display the result with the embedded <em> tags.

然后,您可以使用copyField保留一个完整的HTML表示形式的字段,而其中一个不包含HTML的字段(用于突出显示).

You can then use copyField to have a field with the HTML representation intact, and one without the HTML contained (to use for highlighting).

这篇关于如何处理SOLR中高亮片段中的html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆