如何只标记Lucene中的某些单词 [英] How to tokenize only certain words in Lucene

查看:102
本文介绍了如何只标记Lucene中的某些单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Lucene作为我的项目,我需要一个自定义分析器。

I'm using Lucene for my project and I need a custom Analyzer.

代码是:

public class MyCommentAnalyzer extends Analyzer {

@Override
    protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {

      Tokenizer source = new StandardTokenizer( Version.LUCENE_48, reader );
      TokenStream filter = new StandardFilter( Version.LUCENE_48, source );

      filter = new StopFilter( Version.LUCENE_48, filter, StandardAnalyzer.STOP_WORDS_SET );

      return new TokenStreamComponents( source, filter );
}

}

我已经建成了它,但现在我无法继续下去。我的需求是过滤器必须只选择某些单词。与使用停用词相比,相反的过程:不要从词列表中删除,而只添加词汇表中的术语。像一个预建的字典。
所以StopFilter没有填满目标。 Lucene提供的过滤器似乎都没有。
我想我需要编写自己的过滤器,但不知道如何。

I've built it, but now I can't go on. My needs is that the filter must select only certain words. Like an opposite process compared to use stopwords: don't remove from a wordlist, but add only the terms in the wordlist. Like a prebuilt dictionary. So the StopFilter doesn't fill the target. And none of the filters Lucene provides seems good. I think I need to write my own filter, but don't know how.

有什么建议吗?

推荐答案

你可以通过 StopFilter 来寻找起点,所以阅读来源

You're right to look to StopFilter for a starting point, so read the source!

大多数 StopFilter 的源代码是构建stopset的所有便捷方法。你可以放心地忽略所有这些(除非你想保留它以构建你的保持集)。

Most of StopFilter's source is all convenience methods for building the stopset. You can safely ignore all that (unless you want to keep it around for building your keep set).

切掉所有这些,并且 StopFilter 归结为:

Cut all that, and StopFilter boils down to:

public final class StopFilter extends FilteringTokenFilter {

    private final CharArraySet stopWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public StopFilter(Version matchVersion, TokenStream in, CharArraySet stopWords) {
        super(matchVersion, in);
        this.stopWords = stopWords;
    }

    @Override
    protected boolean accept() {
        return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

FilteringTokenFilter 是一个非常简单的实现类。关键是接受方法。当它被调用当前术语时,如果它返回true,则该术语被添加到输出流中。如果它返回false,则丢弃当前术语。

FilteringTokenFilter is a pretty simple class to implement. The key is just the accept method. When it's called for the current term, if it returns true, the term is added to the output stream. If it returns false, the current term is discarded.

因此真正中唯一需要更改的内容 StopFilter 是删除单个字符,使 accept 返回与当前相反的字符。同样也不会改变一些名字。

So the only thing you really need to change in StopFilter is to delete a single character, to make accept return the opposite of what it currently does. Wouldn't hurt to change a few names here and there, as well.

public final class KeepOnlyFilter extends FilteringTokenFilter {

    private final CharArraySet keepWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public KeepOnlyFilter(Version matchVersion, TokenStream in, CharArraySet keepWords) {
        super(matchVersion, in);
        this.keepWords = keepWords;
    }

    @Override
    protected boolean accept() {
        return keepWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

这篇关于如何只标记Lucene中的某些单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆