Apache的Lucene的不尽管StopAnalyzer和的StopFilter的使​​用过滤停用词 [英] Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

查看:566
本文介绍了Apache的Lucene的不尽管StopAnalyzer和的StopFilter的使​​用过滤停用词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个基于 A模块的Apache Lucene的 5.5 / 6.0这检索关键字。一切正常,除了一件事 - Lucene的不过滤停用词

我试图使停用词用两种不同的方法进行过滤。

办法#1:

 的TokenStream =新的StopFilter(新ASCIIFoldingFilter(新ClassicFilter(新LowerCaseFilter(stdToken))),EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();

方法2:

 的TokenStream =新的StopFilter(新ClassicFilter(新LowerCaseFilter(stdToken)),S​​topAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();

满code可以在这里找到:结果
http://stackoverflow.com/a/36237769/462347

我的问题:


  1. 为什么 Lucene的不过滤停用词?

  2. 如何启用停用词在 Lucene的过滤 5.5 / 6.0?


解决方案

问题是,我期望默认<为code> Lucene的的停用词列表将更加广泛的

下面是code在默认情况下尝试加载自定义的停用词列表,如果它失败了,然后使用标准之一:

  CharArraySet stopWordsSet;尝试{
    //使用自定义的停用词列表
    字符串stopWordsDictionary = FileUtils.readFileToString(新文件(%文件路径%));
    stopWordsSet = WordlistLoader.getWordSet(新StringReader(stopWordsDictionary));
}赶上(FileNotFoundException异常五){
    //使用标准的停用词列表
    stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}的TokenStream =新的StopFilter(新ASCIIFoldingFilter(新ClassicFilter(新LowerCaseFilter(stdToken))),stopWordsSet);
tokenStream.reset();

I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words.

I tried to enable stop word filtering with two different approaches.

Approach #1:

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();

Approach #2:

tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();

The full code is available here:
http://stackoverflow.com/a/36237769/462347

My questions:

  1. Why Lucene doesn't filter stop words?
  2. How can I enable the stop words filtering in Lucene 5.5 / 6.0?

解决方案

The problem was that I expected that the default Lucene's stop words list will be much more broader.

Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:

CharArraySet stopWordsSet;

try {
    // use customized stop words list
    String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
    stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
    // use standard stop words list
    stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}

tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();

这篇关于Apache的Lucene的不尽管StopAnalyzer和的StopFilter的使​​用过滤停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆