Apache的Lucene的不尽管StopAnalyzer和的StopFilter的使用过滤停用词 [英] Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter
问题描述
我有一个基于 A模块的Apache Lucene的
5.5 / 6.0这检索关键字。一切正常,除了一件事 - Lucene的
不过滤停用词
我试图使停用词用两种不同的方法进行过滤。
的办法#1:的
的TokenStream =新的StopFilter(新ASCIIFoldingFilter(新ClassicFilter(新LowerCaseFilter(stdToken))),EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();
的方法2:的
的TokenStream =新的StopFilter(新ClassicFilter(新LowerCaseFilter(stdToken)),StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();
满code可以在这里找到:结果
http://stackoverflow.com/a/36237769/462347
我的问题:
- 为什么
Lucene的
不过滤停用词? - 如何启用停用词在
Lucene的过滤
5.5 / 6.0?
问题是,我期望默认<为code> Lucene的的停用词列表将更加广泛的
下面是code在默认情况下尝试加载自定义的停用词列表,如果它失败了,然后使用标准之一:
CharArraySet stopWordsSet;尝试{
//使用自定义的停用词列表
字符串stopWordsDictionary = FileUtils.readFileToString(新文件(%文件路径%));
stopWordsSet = WordlistLoader.getWordSet(新StringReader(stopWordsDictionary));
}赶上(FileNotFoundException异常五){
//使用标准的停用词列表
stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}的TokenStream =新的StopFilter(新ASCIIFoldingFilter(新ClassicFilter(新LowerCaseFilter(stdToken))),stopWordsSet);
tokenStream.reset();
I have a module based on Apache Lucene
5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene
doesn't filter stop words.
I tried to enable stop word filtering with two different approaches.
Approach #1:
tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet());
tokenStream.reset();
Approach #2:
tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)), StopAnalyzer.ENGLISH_STOP_WORDS_SET);
tokenStream.reset();
The full code is available here:
http://stackoverflow.com/a/36237769/462347
My questions:
- Why
Lucene
doesn't filter stop words? - How can I enable the stop words filtering in
Lucene
5.5 / 6.0?
The problem was that I expected that the default Lucene
's stop words list will be much more broader.
Here is the code which by default tries to load the customized stop words list and if it's failed then uses the standard one:
CharArraySet stopWordsSet;
try {
// use customized stop words list
String stopWordsDictionary = FileUtils.readFileToString(new File(%PATH_TO_FILE%));
stopWordsSet = WordlistLoader.getWordSet(new StringReader(stopWordsDictionary));
} catch (FileNotFoundException e) {
// use standard stop words list
stopWordsSet = CharArraySet.copy(StandardAnalyzer.STOP_WORDS_SET);
}
tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), stopWordsSet);
tokenStream.reset();
这篇关于Apache的Lucene的不尽管StopAnalyzer和的StopFilter的使用过滤停用词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!