Java中的停用词和词干 [英] Stop words and stemmer in java

查看:129
本文介绍了Java中的停用词和词干的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在考虑在我的相似性程序中添加一个停用词,然后添加一个词干(针对搬运工1或2,取决于最容易实现的方法)

I'm thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement)

我想知道,既然我从文件中读取文本作为整行并将它们保存为长字符串,那么如果我得到两个字符串,那么.

I was wondering that since I read my text from files as whole lines and save them as a long string, so if I got two strings ex.

String one = "I decided buy something from the shop.";
String two = "Nevertheless I decidedly bought something from a shop.";

现在我得到了那些字符串

Now that I got those strings

词干分析: 我是否可以直接在其上使用stemmer算法,将其另存为String,然后像在程序中实现stemmer之前一样继续进行类似工作,例如运行one.stem();.之类的事情?

Stemming: Can I just use the stemmer algoritmen directly on it, save it as a String and then continue working on the similarity like I did before implementing the stemmer in the program, like running one.stem(); kind of thing?

停用词: 如何解决? . 我只用吗? one.replaceall("I",");还是有一些特定的方法可用于此过程?我想继续使用该字符串并获取一个字符串,然后再对其使用相似性算法以获取相似性.维基没有多说.

Stop word: How does this work out? O.o Do I just use; one.replaceall("I", ""); or is there some specific way to use for this proces? I want to keep working with the string and get a string before using the similarity algorithms on it to get the similarity. Wiki doesn't say a lot.

希望您能帮助我!谢谢.

Hope you can help me out! Thanks.

这是一个与学校有关的项目,我在其中撰写有关不同算法之间相似性的论文,因此我认为我不允许使用lucene或其他为我工作的库.另外,在开始使用Lucene和co之类的库之前,我想尝试并了解它的工作原理.希望它不会太麻烦^^

It is for a school-related project where I'm writing a paper on similarity between different algorithms so I don't think I'm allowed to use lucene or other libraries that does the work for me. Plus I would like to try and understand how it works before I start using the libraries like Lucene and co. Hope it's not too much a bother ^^

推荐答案

如果出于学术原因未实施此操作,则应考虑使用

If you're not implementing this for academic reasons you should consider using the Lucene library. In either case it might be good for reference. It has classes for tokenization, stop word filtering, stemming and similarity. Here's a quick example using Lucene 3.0 to remove stop words and stem an input string:

public static String removeStopWordsAndStem(String input) throws IOException {
    Set<String> stopWords = new HashSet<String>();
    stopWords.add("a");
    stopWords.add("I");
    stopWords.add("the");

    TokenStream tokenStream = new StandardTokenizer(
            Version.LUCENE_30, new StringReader(input));
    tokenStream = new StopFilter(true, tokenStream, stopWords);
    tokenStream = new PorterStemFilter(tokenStream);

    StringBuilder sb = new StringBuilder();
    TermAttribute termAttr = tokenStream.getAttribute(TermAttribute.class);
    while (tokenStream.incrementToken()) {
        if (sb.length() > 0) {
            sb.append(" ");
        }
        sb.append(termAttr.term());
    }
    return sb.toString();
}

如果在这样的字符串上使用,则如下:

Which if used on your strings like this:

public static void main(String[] args) throws IOException {
    String one = "I decided buy something from the shop.";
    String two = "Nevertheless I decidedly bought something from a shop.";
    System.out.println(removeStopWordsAndStem(one));
    System.out.println(removeStopWordsAndStem(two));
}

获得此输出:

decid bui someth from shop
Nevertheless decidedli bought someth from shop

这篇关于Java中的停用词和词干的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆