HTMLStripCharFilterFactory @ Solr 3.4是否会为返回的字段删除html? [英] Does HTMLStripCharFilterFactory @ Solr 3.4 strip out html for returned fields?

查看:282
本文介绍了HTMLStripCharFilterFactory @ Solr 3.4是否会为返回的字段删除html?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用CF10,根据Corporatezen.com/2013/11/updating-solr-engine-coldfusion,它应该使用Solr 3.4。我将< charFilter class = solr.HTMLStripCharFilterFactory /> 添加到< fieldType name = text> ,但是搜索结果中的摘要字段仍包含HTML。知道为什么吗?

I'm using CF10 which should be using Solr 3.4 according to corporatezen.com/2013/11/updating-solr-engine-coldfusion. I added <charFilter class="solr.HTMLStripCharFilterFactory"/> to <fieldType name="text"> but the summary field in the search result still includes HTML. Any idea why?

<字段名称=摘要 type =文本 indexed = false存储= true required = false />

http:// localhost:8985 / solr / test / admin / schema.jsp 显示:


字段:摘要字段类型:TEXT

Field: summary Field Type: TEXT

属性:标记化,存储

模式:标记化,存储

位置增量差距:100

Position Increment Gap: 100

索引分析器:org.apache.solr.analysis.TokenizerChain细节

Index Analyzer: org.apache.solr.analysis.TokenizerChain DETAILS

字符过滤器:

org.apache.solr.analysis.HTMLStripCharFilterFactory
args:{luceneMatchVersion:LUCENE_24}令牌生成器类:
org.apache.solr.analysis.WhitespaceTokenizerFactory

org.apache.solr.analysis.HTMLStripCharFilterFactory args:{luceneMatchVersion: LUCENE_24 } Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

过滤器:

org。 apache.solr.analysis.StopFilterFactory args:{words:stopwords.txt
ignoreCase:true enablePositionIncrements:真正的luceneMatchVersion:
LUCENE_24} org.apache.solr.analysis.WordDelimiterFilterFactory
args:{splitOnCaseChange:1 generateNumberParts:1 catenateWords:1
luceneMatchVersion:LUCENE_24 generateWordParts:1 cate $ All 0: b catenateNumbers:1} org.apache.solr.analysis.LowerCaseFilterFactory
args:{luceneMatchVersion:LUCENE_24}
org.apache.solr.analysis.EnglishPorterFilterFactory args:{受保护:
protwords.txt luceneMatchVersion:LUCENE_24}
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
args:{luceneMatchVersion:LUCENE_24}查询分析器:
org.apache.solr.analysis.TokenizerChain DETAILS

org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true enablePositionIncrements: true luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.WordDelimiterFilterFactory args:{splitOnCaseChange: 1 generateNumberParts: 1 catenateWords: 1 luceneMatchVersion: LUCENE_24 generateWordParts: 1 catenateAll: 0 catenateNumbers: 1 } org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.EnglishPorterFilterFactory args:{protected: protwords.txt luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_24 } Query Analyzer: org.apache.solr.analysis.TokenizerChain DETAILS

字符过滤器:

org.apache.solr.analysis.HTMLStripCharFilterFactory
args:{luceneMatchVersion:LUCENE_24}令牌生成器类:
org.apache.solr.analysis.WhitespaceTokenizerFactory

org.apache.solr.analysis.HTMLStripCharFilterFactory args:{luceneMatchVersion: LUCENE_24 } Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

过滤器:

org.apache。 solr.analysis.SynonymFilterF actory args:{同义词:
同义词.txt扩展:true ignoreCase:true luceneMatchVersion:
LUCENE_24} org.apache.solr.analysis.StopFilterFactory args:{words:
stopwords.txt ignoreCase:真正的luceneMatchVersion:LUCENE_24}
org.apache.solr.analysis.WordDelimiterFilterFactory
args:{splitOnCaseChange:1 generateNumberParts:1 catenateWords:0
luceneMatchVersion:LUCENE_24 generateWordParts:1 cateAllAll: b catenateNumbers:0} org.apache.solr.analysis.LowerCaseFilterFactory
args:{luceneMatchVersion:LUCENE_24}
org.apache.solr.analysis.EnglishPorterFilterFactory args:{受保护:
protwords.txt luceneMatchVersion:LUCENE_24}
org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory
args:{luceneMatchVersion:LUCENE_24}

org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: synonyms.txt expand: true ignoreCase: true luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.WordDelimiterFilterFactory args:{splitOnCaseChange: 1 generateNumberParts: 1 catenateWords: 0 luceneMatchVersion: LUCENE_24 generateWordParts: 1 catenateAll: 0 catenateNumbers: 0 } org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.EnglishPorterFilterFactory args:{protected: protwords.txt luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_24 }


推荐答案

您需要区分存储的和索引的。添加到该字段的过滤器将更改存储在Solr索引中以供搜索的令牌,但不会更改用于显示的存储值。

You need to differentiate between the stored and the indexed. The filter you have added to the field will alter the tokens that are stored in Solr's index, for searching, but not the stored values for display.

Solr保留两个版本的字段*。一是存储的一。这是文本的原始部分,您的情况是 包含HTML。另一个是索引版本。

Solr keeps two versions of a field*. One is the stored one. This is the original portion of text, in your case with HTML included. The other one is the index version. There the whole magic of text analysis has been applied.

然后,当您执行搜索时,索引用于查找哪些文档创建了匹配项。当显示结果时,将向您显示存储的版本。

Then when you perform a search, the index is used to look up which documents have created a match. When displaying the result, the stored version is presented to you.

*当然,只有在您打开 stored = true indexed = true

* Of course only in case that you turned on stored="true" and indexed="true".

这篇关于HTMLStripCharFilterFactory @ Solr 3.4是否会为返回的字段删除html?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆