带有茎的StandardAnalyzer [英] StandardAnalyzer with stemming

查看:114
本文介绍了带有茎的StandardAnalyzer的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有一种方法可以将PorterStemFilter集成到Lucene中的StandardAnalyzer中,还是必须复制/粘贴StandardAnalyzers源代码并添加过滤器,因为StandardAnalyzer被定义为最终类.有什么更聪明的方法吗?

Is there a way to integrate PorterStemFilter into StandardAnalyzer in Lucene, or do I have to copy/paste StandardAnalyzers source code, and add the filter, since StandardAnalyzer is defined as final class. Is there any smarter way?

此外,如果我不想考虑数字,该如何实现?

Also, if I would like not to consider numbers, how can I achieve that?

谢谢

推荐答案

如果要将此组合用于英语文本分析,则应使用Lucene的EnglishAnalyzer.否则,您可以创建一个新的Analyzer来扩展AnalyzerWraper,如下所示.

If you want to use this combination for English text analysis, then you should use Lucene's EnglishAnalyzer. Otherwise, you could create a new Analyzer that extends the AnalyzerWraper as shown below.

import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Set;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.AnalyzerWrapper;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.TypeTokenFilter;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;


public class PorterAnalyzer extends AnalyzerWrapper {

  private Analyzer baseAnalyzer;

  public PorterAnalyzer(Analyzer baseAnalyzer) {
      this.baseAnalyzer = baseAnalyzer;
  }

  @Override
  public void close() {
      baseAnalyzer.close();
      super.close();
  }

  @Override
  protected Analyzer getWrappedAnalyzer(String fieldName)
  {
      return baseAnalyzer;
  }

  @Override
  protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components)
  {
      TokenStream ts = components.getTokenStream();
      Set<String> filteredTypes = new HashSet<>();
      filteredTypes.add("<NUM>");
      TypeTokenFilter numberFilter = new TypeTokenFilter(Version.LUCENE_46,ts, filteredTypes);

      PorterStemFilter porterStem = new PorterStemFilter(numberFilter);
      return new TokenStreamComponents(components.getTokenizer(), porterStem);
  }

  public static void main(String[] args) throws IOException
  {

      //Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
      PorterAnalyzer analyzer = new PorterAnalyzer(new StandardAnalyzer(Version.LUCENE_46));
      String text = "This is a testing example. It should tests the Porter stemmer version 111";

      TokenStream ts = analyzer.tokenStream("fieldName", new StringReader(text));
      ts.reset();

      while (ts.incrementToken()){
          CharTermAttribute ca = ts.getAttribute(CharTermAttribute.class);

          System.out.println(ca.toString());
      }
      analyzer.close();
  }

}

上面的代码基于此 lucene论坛主题.主要工作是通过wrapComponents方法实现的.首先从包装的分析器获取TokenStream对象,然后应应用类型过滤器以忽略数字标记.最后,您应用搬运工茎过滤器.我希望这很清楚.

The code above is based on this lucene forum thread's. The main work is implemented by the wrapComponents method. You first get the TokenStream object from the wrapped analyzer, you then shoud apply a type filter to ignore numerical tokens. Lastly, you apply the porter stemmer filter. I hope it is clear.

这篇关于带有茎的StandardAnalyzer的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆