如何在 Lucene 中仅标记某些单词 [英] How to tokenize only certain words in Lucene

查看:18
本文介绍了如何在 Lucene 中仅标记某些单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在我的项目中使用 Lucene,我需要一个自定义分析器.

I'm using Lucene for my project and I need a custom Analyzer.

代码是:

public class MyCommentAnalyzer extends Analyzer {

@Override
    protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {

      Tokenizer source = new StandardTokenizer( Version.LUCENE_48, reader );
      TokenStream filter = new StandardFilter( Version.LUCENE_48, source );

      filter = new StopFilter( Version.LUCENE_48, filter, StandardAnalyzer.STOP_WORDS_SET );

      return new TokenStreamComponents( source, filter );
}

}

我已经建立了它,但现在我无法继续.我的需求是过滤器必须只选择某些单词.与使用停用词相比,就像一个相反的过程:不要从词表中删除,而只添加词表中的术语.就像一个预建的字典.所以 StopFilter 不会填充目标.Lucene 提供的过滤器似乎都不是很好.我想我需要编写自己的过滤器,但不知道如何.

I've built it, but now I can't go on. My needs is that the filter must select only certain words. Like an opposite process compared to use stopwords: don't remove from a wordlist, but add only the terms in the wordlist. Like a prebuilt dictionary. So the StopFilter doesn't fill the target. And none of the filters Lucene provides seems good. I think I need to write my own filter, but don't know how.

有什么建议吗?

推荐答案

你可以从 StopFilter 开始,所以 阅读源代码

You're right to look to StopFilter for a starting point, so read the source!

StopFilter 的大部分源代码都是用于构建 stopset 的便捷方法.您可以放心地忽略所有这些(除非您想保留它以构建您的保留集).

Most of StopFilter's source is all convenience methods for building the stopset. You can safely ignore all that (unless you want to keep it around for building your keep set).

去掉所有这些,StopFilter 归结为:

Cut all that, and StopFilter boils down to:

public final class StopFilter extends FilteringTokenFilter {

    private final CharArraySet stopWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public StopFilter(Version matchVersion, TokenStream in, CharArraySet stopWords) {
        super(matchVersion, in);
        this.stopWords = stopWords;
    }

    @Override
    protected boolean accept() {
        return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

FilteringTokenFilter 是一个很容易实现的类.关键就是 accept 方法.当为当前术语调用它时,如果它返回 true,则将该术语添加到输出流中.如果返回 false,则丢弃当前术语.

FilteringTokenFilter is a pretty simple class to implement. The key is just the accept method. When it's called for the current term, if it returns true, the term is added to the output stream. If it returns false, the current term is discarded.

所以您真正需要在 StopFilter 中更改的唯一一件事就是删除单个字符,以使 accept 返回与什么相反的内容目前确实如此.在这里和那里改几个名字也没什么坏处.

So the only thing you really need to change in StopFilter is to delete a single character, to make accept return the opposite of what it currently does. Wouldn't hurt to change a few names here and there, as well.

public final class KeepOnlyFilter extends FilteringTokenFilter {

    private final CharArraySet keepWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public KeepOnlyFilter(Version matchVersion, TokenStream in, CharArraySet keepWords) {
        super(matchVersion, in);
        this.keepWords = keepWords;
    }

    @Override
    protected boolean accept() {
        return keepWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

这篇关于如何在 Lucene 中仅标记某些单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆