Tokenizer,停止删除Word,用Java编写 [英] Tokenizer, Stop Word Removal, Stemming in Java

查看:163
本文介绍了Tokenizer,停止删除Word,用Java编写的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一个类或方法,它需要长达100多个单词和标记的字符串,删除用于IR系统的停用词和词干。



<例如:


肥胖的大猫,对我说的'你知道的最有趣的家伙'...


标记器会删除标点并返回 ArrayList 单词



停用词删除器会删除the,to等单词



词干将减少每个单词的'root',例如'funniest'会变得有趣



非常感谢提前。

解决方案

AFAIK Lucene 可以做你想做的事。使用 StandardAnalyzer StopAnalyzer ,您可以删除停用词。结合 Lucene contrib-snowball (其中包括来自 Snowball的作品) )项目你也可以做干预。



但对于词干还要考虑这个答案:产生真实单词的词干算法


I am looking for a class or method that takes a long string of many 100s of words and tokenizes, removes the stop words and stems for use in an IR system.

For example:

"The big fat cat, said 'your funniest guy i know' to the kangaroo..."

the tokenizer would remove the punctuation and return an ArrayList of words

the stop word remover would remove words like "the", "to", etc

the stemmer would reduce each word the their 'root', for example 'funniest' would become funny

Many thanks in advance.

解决方案

AFAIK Lucene can do what you want. With StandardAnalyzer and StopAnalyzer you can to the stop word removal. In combination with the Lucene contrib-snowball (which includes work from Snowball) project you can do the stemming too.

But for stemming also consider this answer to: Stemming algorithm that produces real words

这篇关于Tokenizer,停止删除Word,用Java编写的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆