Lucene索引忽略撇号 [英] Lucene Indexing to ignore apostrophes
问题描述
我有一个可能在其中带有撇号的字段. 我希望能够: 1.将值原样存储在索引中 2.根据忽略任何撇号的值进行搜索.
I have a field that might have apostrophes in it. I want to be able to: 1. store the value as is in the index 2. search based on the value ignoring any apostrophes.
我正在考虑使用:
doc.add(new Field("name", value, Store.YES, Index.NO));
doc.add(new Field("name", value.replaceAll("['‘’`]",""), Store.NO, Index.ANALYZED));
如果我随后在搜索时执行相同的替换操作,我猜它应该可以工作,并使用清除的值索引/搜索,并按原样显示该值.
if I then do the same replace when searching I guess it should work and use the cleared value to index/search and the value as is for display.
我在这里还有其他考虑吗?
am I missing any other considerations here ?
推荐答案
直接在值上执行replaceAll
在Lucene中是不好的做法,因为将标记化配方封装在Analyzer
中会更好.我也看不到在您的用例中添加字段的好处(请参阅
Performing replaceAll
directly on the value its a bad practice in Lucene, since it would a much better practice to encapsulate your tokenization recipe in an Analyzer
. Also I don't see the benefit of appending fields in your use case (See Document.add).
如果您想存储原始值,并且仍能够在不带撇号的情况下进行搜索,则只需像下面这样声明您的字段即可:
If you want to Store the original value and yet be able to search without the apostrophes simply declare your field like this:
doc.add(new Field("name", value, Store.YES, Index.ANALYZED);
然后只需挂接一个将替换撇号的自定义Tokenizer
(我认为Lucene的StandardAnalyzer
已经包含此转换).
Then simply hook up a custom Tokenizer
that will replace apostrophes (I think the Lucene's StandardAnalyzer
already includes this transformation).
如果要使用突出显示来存储字段,则还应该考虑使用Field.TermVector.WITH_POSITIONS_OFFSETS
.
If you are storing the field with the aim of using highlighting you should also consider using Field.TermVector.WITH_POSITIONS_OFFSETS
.
这篇关于Lucene索引忽略撇号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!