使用CharFilter和Lucene 4.3.0的StandardAnalyzer [英] Using CharFilter with Lucene 4.3.0's StandardAnalyzer
问题描述
我正在尝试将 CharFilter
添加到我的 StandardAnalyzer
。我的目的是从我索引的所有文本中删除标点符号;例如,我想要一个PrefixQuerypf来匹配PF Chang或zaras以匹配Zara。
I am trying to add a CharFilter
to my StandardAnalyzer
. My intention is to strip out punctuation from all the text I index; for example I want a PrefixQuery "pf" to match "P.F. Chang's" or "zaras" to match "Zara's".
这里最简单的攻击计划似乎是在分析之前过滤掉所有标点符号。根据 Analyzer包文档 ,这意味着我应该使用 CharFilter
。
It seems that the easiest plan of attack here is to filter out all punctuation before analysis. Per the Analyzer package documentation, that means I should use a CharFilter
.
然而,实际上几乎不可能插入 CharFilter
进入分析器!
However, it seems next to impossible to actually insert a CharFilter
into the analyzer!
JavaDoc Analyzer.initReader 说覆盖这个,如果你想插入一个CharFilter。
The JavaDoc for Analyzer.initReader says "Override this if you want to insert a CharFilter".
如果我的代码扩展了Analyzer,我可以扩展initReader,但我不能委托摘要 createComponents 到我的基础StandardAnalyzer,因为它受到保护。我不能委托 tokenStream 到我的基础分析器,因为它是最终的。所以Analyzer的子类似乎不能使用另一个Analyzer来完成它的脏工作。
If my code extends Analyzer, I can extend initReader but I cannot delegate the abstract createComponents to my base StandardAnalyzer, as it is protected. I cannot delegate tokenStream to my base analyzer, because it is final. So a subclass of Analyzer seemingly cannot use another Analyzer to do its dirty work.
有一个 AnalyzerWrapper
类似乎完美的我想要的!我可以提供一个基础分析器,只覆盖我想要的部分。除了...... initReader 已被覆盖以委托给基础分析器,这个覆盖是最终的!糟糕!
There is an AnalyzerWrapper
class that seems perfect for what I want! I can provide a base analyzer and only override the pieces that I want. Except … initReader is overridden already to delegate to the base analyzer, and this override is "final"! Bummer!
我想我的分析器
可以在 org.apache中。 lucene.analyzers
包然后我可以访问受保护的 createComponents
方法,但这似乎是一种令人厌恶的hacky方式绕过公共API,我真的应该使用。
I guess I could have my Analyzer
be in the org.apache.lucene.analyzers
package and then I can access the protected createComponents
method, but this seems like a disgustingly hacky way to bypass the public API that I really should use.
我错过了一些明显的东西吗?如何修改 StandardAnalyzer
以使用自定义 CharFilter
?
Am I missing something glaring here? How can I amend a StandardAnalyzer
to use a custom CharFilter
?
推荐答案
目的是覆盖 Analyzer
,而不是 StandardAnalyzer
。我们的想法是你永远不应该对Analyzer实现进行子类化(对这里的一些讨论) )。分析器实现非常简单,并且向实现与StandardAnalyzer相同的标记器/过滤器链的分析器添加CharFilter将类似于:
The intent is for you to override Analyzer
, rather than StandardAnalyzer
. The thinking is that you should never subclass an Analyzer implementation (some discussion of there here). Analyzer implementations are pretty straightforward though, and adding a CharFilter to an Analyzer implementing the same tokenizer/filter chain as StandardAnalyzer would look something like:
public final class MyAnalyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
TokenStream tok = new StandardFilter(matchVersion, src);
tok = new LowerCaseFilter(matchVersion, tok);
tok = new StopFilter(matchVersion, tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
return new TokenStreamComponents(src, tok);
}
@Override
protected Reader initReader(String fieldName, Reader reader) {
//return your CharFilter-wrapped reader here
}
}
这篇关于使用CharFilter和Lucene 4.3.0的StandardAnalyzer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!