如何在Java中使用Lucene添加自定义停用词 [英] how to add custom stop words using lucene in java

查看:245
本文介绍了如何在Java中使用Lucene添加自定义停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用lucene删除英语停用词,但我的要求是删除英语停用词和自定义停用词.以下是我使用lucene删除英语停用词的代码.

I am using lucene to remove English Stop words but my requirement is remove English stop words and Custom stop words. Below is my code to remove English stop words using lucene.

我的示例代码:

public class Stopwords_remove {
    public String removeStopWords(String string) throws IOException 
    {
        StandardAnalyzer ana = new StandardAnalyzer(Version.LUCENE_30);
        TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36,newStringReader(string));
        StringBuilder sb = new StringBuilder();
        tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, ana.STOP_WORDS_SET);
        CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);
        while (tokenStream.incrementToken()) 
        {
            if (sb.length() > 0) 
            {
                sb.append(" ");
            }
            sb.append(token.toString());
        }
        return sb.toString();
    }

    public static void main(String args[]) throws IOException
    {
          String text = "this is a java project written by james.";
          Stopwords_remove stopwords = new Stopwords_remove();
          stopwords.removeStopWords(text);

    }
}

输出:java project written james.

必需的输出:java project james.

我该怎么做?

推荐答案

您可以将其他停用词添加到标准英语停用词集的副本中,也可以仅添加另一个StopFilter.喜欢:

You could add add your additional stop words into a copy of the standard english stop word set, or just add in another StopFilter. Like:

TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string));
CharArraySet stopSet = CharArraySet.copy(Version.LUCENE_36, StandardAnalyzer.STOP_WORD_SET);
stopSet.add("add");
stopSet.add("your");
stopSet.add("stop");
stopSet.add("words");
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, stopSet);
//Or, if you just need the added stopwords in a standardanalyzer, you could just pass this stopfilter into the StandardAnalyzer...
//analyzer = new StandardAnalyzer(Version.LUCENE_36, stopSet);

或:

TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string));
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StandardAnalyzer.STOP_WORDS_SET);
List<String> stopWords = //your list of stop words.....
tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StopFilter.makeStopSet(Version.LUCENE_36, stopWords));

如果您尝试创建自己的分析器,则最好遵循类似

If you are trying to create your own Analyzer, you might be better served following a pattern more like the example in the Analyzer documentation.

这篇关于如何在Java中使用Lucene添加自定义停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆