使用CharFilter和Lucene 4.3.0的StandardAnalyzer [英] Using CharFilter with Lucene 4.3.0's StandardAnalyzer

查看:98
本文介绍了使用CharFilter和Lucene 4.3.0的StandardAnalyzer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 CharFilter 添加到我的 StandardAnalyzer 。我的目的是从我索引的所有文本中删除标点符号;例如,我想要一个PrefixQuerypf来匹配PF Chang或zaras以匹配Zara。

I am trying to add a CharFilter to my StandardAnalyzer. My intention is to strip out punctuation from all the text I index; for example I want a PrefixQuery "pf" to match "P.F. Chang's" or "zaras" to match "Zara's".

这里最简单的攻击计划似乎是在分析之前过滤掉所有标点符号。根据 Analyzer包文档 ,这意味着我应该使用 CharFilter

It seems that the easiest plan of attack here is to filter out all punctuation before analysis. Per the Analyzer package documentation, that means I should use a CharFilter.

然而,实际上几乎不可能插入 CharFilter 进入分析器!

However, it seems next to impossible to actually insert a CharFilter into the analyzer!

JavaDoc Analyzer.initReader 说覆盖这个,如果你想插入一个CharFilter。

The JavaDoc for Analyzer.initReader says "Override this if you want to insert a CharFilter".

如果我的代码扩展了Analyzer,我可以扩展initReader,但我不能委托摘要 createComponents 到我的基础StandardAnalyzer,因为它受到保护。我不能委托 tokenStream 到我的基础分析器,因为它是最终的。所以Analyzer的子类似乎不能使用另一个Analyzer来完成它的脏工作。

If my code extends Analyzer, I can extend initReader but I cannot delegate the abstract createComponents to my base StandardAnalyzer, as it is protected. I cannot delegate tokenStream to my base analyzer, because it is final. So a subclass of Analyzer seemingly cannot use another Analyzer to do its dirty work.

有一个 AnalyzerWrapper 类似乎完美的我想要的!我可以提供一个基础分析器,只覆盖我想要的部分。除了...... initReader 已被覆盖以委托给基础分析器,这个覆盖是最终的!糟糕!

There is an AnalyzerWrapper class that seems perfect for what I want! I can provide a base analyzer and only override the pieces that I want. Except … initReader is overridden already to delegate to the base analyzer, and this override is "final"! Bummer!

我想我的分析器可以在 org.apache中。 lucene.analyzers 包然后我可以访问受保护的 createComponents 方法,但这似乎是一种令人厌恶的hacky方式绕过公共API,我真的应该使用。

I guess I could have my Analyzer be in the org.apache.lucene.analyzers package and then I can access the protected createComponents method, but this seems like a disgustingly hacky way to bypass the public API that I really should use.

我错过了一些明显的东西吗?如何修改 StandardAnalyzer 以使用自定义 CharFilter

Am I missing something glaring here? How can I amend a StandardAnalyzer to use a custom CharFilter?

推荐答案

目的是覆盖 Analyzer ,而不是 StandardAnalyzer 。我们的想法是你永远不应该对Analyzer实现进行子类化(对这里的一些讨论) )。分析器实现非常简单,并且向实现与StandardAnalyzer相同的标记器/过滤器链的分析器添加CharFilter将类似于:

The intent is for you to override Analyzer, rather than StandardAnalyzer. The thinking is that you should never subclass an Analyzer implementation (some discussion of there here). Analyzer implementations are pretty straightforward though, and adding a CharFilter to an Analyzer implementing the same tokenizer/filter chain as StandardAnalyzer would look something like:

public final class MyAnalyzer {
    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
        final StandardTokenizer src = new StandardTokenizer(matchVersion, reader);
        TokenStream tok = new StandardFilter(matchVersion, src);
        tok = new LowerCaseFilter(matchVersion, tok);
        tok = new StopFilter(matchVersion, tok, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        return new TokenStreamComponents(src, tok);
    }

    @Override
    protected Reader initReader(String fieldName, Reader reader) {
        //return your CharFilter-wrapped reader here
    }
}

这篇关于使用CharFilter和Lucene 4.3.0的StandardAnalyzer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆