Tokenizer,停止删除Word,用Java编写 [英] Tokenizer, Stop Word Removal, Stemming in Java
问题描述
我正在寻找一个类或方法,它需要长达100多个单词和标记的字符串,删除用于IR系统的停用词和词干。
<例如:
肥胖的大猫,对我说的'你知道的最有趣的家伙'...
标记器会删除标点并返回 ArrayList
单词
停用词删除器会删除the,to等单词
词干将减少每个单词的'root',例如'funniest'会变得有趣
非常感谢提前。
AFAIK Lucene 可以做你想做的事。使用 StandardAnalyzer
和 StopAnalyzer
,您可以删除停用词。结合 Lucene contrib-snowball
(其中包括来自 Snowball的作品) )项目你也可以做干预。
但对于词干还要考虑这个答案:产生真实单词的词干算法
I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system.
For example:
"The big fat cat, said 'your funniest guy i know' to the kangaroo..."
the tokenizer would remove the punctuation and return an ArrayList
of words
the stop word remover would remove words like "the", "to", etc
the stemmer would reduce each word the their 'root', for example 'funniest' would become funny
Many thanks in advance.
AFAIK Lucene can do what you want. With StandardAnalyzer
and StopAnalyzer
you can to the stop word removal. In combination with the Lucene contrib-snowball
(which includes work from Snowball) project you can do the stemming too.
But for stemming also consider this answer to: Stemming algorithm that produces real words
这篇关于Tokenizer,停止删除Word,用Java编写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!