Lucene索引忽略撇号 [英] Lucene Indexing to ignore apostrophes

查看:85
本文介绍了Lucene索引忽略撇号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个可能在其中带有撇号的字段. 我希望能够: 1.将值原样存储在索引中 2.根据忽略任何撇号的值进行搜索.

I have a field that might have apostrophes in it. I want to be able to: 1. store the value as is in the index 2. search based on the value ignoring any apostrophes.

我正在考虑使用:

   doc.add(new Field("name", value, Store.YES, Index.NO));
   doc.add(new Field("name", value.replaceAll("['‘’`]",""), Store.NO, Index.ANALYZED));

如果我随后在搜索时执行相同的替换操作,我猜它应该可以工作,并使用清除的值索引/搜索,并按原样显示该值.

if I then do the same replace when searching I guess it should work and use the cleared value to index/search and the value as is for display.

我在这里还有其他考虑吗?

am I missing any other considerations here ?

推荐答案

直接在值上执行replaceAll在Lucene中是不好的做法,因为将标记化配方封装在Analyzer中会更好.我也看不到在您的用例中添加字段的好处(请参阅

Performing replaceAll directly on the value its a bad practice in Lucene, since it would a much better practice to encapsulate your tokenization recipe in an Analyzer. Also I don't see the benefit of appending fields in your use case (See Document.add).

如果您想存储原始值,并且仍能够在不带撇号的情况下进行搜索,则只需像下面这样声明您的字段即可:

If you want to Store the original value and yet be able to search without the apostrophes simply declare your field like this:

doc.add(new Field("name", value, Store.YES, Index.ANALYZED);

然后只需挂接一个将替换撇号的自定义Tokenizer(我认为Lucene的StandardAnalyzer已经包含此转换).

Then simply hook up a custom Tokenizer that will replace apostrophes (I think the Lucene's StandardAnalyzer already includes this transformation).

如果要使用突出显示来存储字段,则还应该考虑使用Field.TermVector.WITH_POSITIONS_OFFSETS.

If you are storing the field with the aim of using highlighting you should also consider using Field.TermVector.WITH_POSITIONS_OFFSETS.

这篇关于Lucene索引忽略撇号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆