从文本中提取名词(Java) [英] Extract Nouns from Text (Java)

查看:206
本文介绍了从文本中提取名词(Java)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有人知道从文本正文中仅提取名词的最简单方法?

Does anyone know the easiest way to extract only nouns from a body of text?

我听说过 TreeTagger工具我试过试一试,但由于某种原因无法让它工作。

I've heard about the TreeTagger tool and I tried giving it a shot but couldn't get it to work for some reason.

有任何建议吗?

感谢Phil

编辑:

 import org.annolab.tt4j.*; 
TreeTaggerWrapper tt = new TreeTaggerWrapper(); 

try { tt.setModel("/Nouns/english.par"); 

tt.setHandler(new TokenHandler() { 
     void token(String token, String pos, String lemma) {    
     System.out.println(token+"\t"+pos+"\t"+lemma); } }); 
     tt.process(words); // words = list of words 

     } finally { tt.destroy(); 
} 

这是我的代码,英语是语言。我收到错误:新类型TokenHandler(){}必须实现继承的抽象方法TokenHandler.token。我做错了什么?

That is my code, English is the language. I was getting the error : The type new TokenHandler(){} must implement the inherited abstract method TokenHandler.token. Am I doing something wrong?

推荐答案

首先,您必须对文本进行标记。这可能看起来微不足道(在任何空格分裂可能对你有用)但正式情况下它更难。然后你必须决定什么是名词。 停车场是否包含一个名词(停车场),两个名词(汽车,公园)或一个名词(公园)和一个形容词(汽车)?这是一个很难的问题,但是如果没有它,你可能会再次成功。

First you will have to tokenize your text. This may seem trivial (split at any whitespace may work for you) but formally it is harder. Then you have to decide what is a noun. Does "the car park" contain one noun (car park), two nouns (car, park) or one noun (park) and one adjective (car)? This is a hard problem, but again you may be able to get by without it.

我看到xyzzy是否识别出不在字典中的名词?单词the可能将xyzzy标识为名词。

Does "I saw the xyzzy" identify a noun not in a dictionary? The word "the" probably identifies xyzzy as a noun.

时间过得像箭头中的名词在哪里。与像香蕉一样的果蝇相比(感谢Groucho Marx)。

Where are the nouns in "time flies like an arrow". Compare with "fruit flies like a banana" (thanks to Groucho Marx).

我们使用Brown tagger(Java)( http://en.wikipedia.org/wiki/Brown_Corpus )(opennlp.tools.lang.english.PosTagger; opennlp。 http://opennlp.sourceforge.net/ 上的tools.postag.POSDictionary用普通英语查找名词和我建议从那开始 - 它为你做了大部分的思考。否则请查看任何POSTaggers
http://en.wikipedia.org/wiki/POS_tagger )或( http://www-nlp.stanford.edu/ links / statnlp.html #Taggers )。

We use the Brown tagger (Java) (http://en.wikipedia.org/wiki/Brown_Corpus) in the OpenNLP toolkit (opennlp.tools.lang.english.PosTagger; opennlp.tools.postag.POSDictionary on http://opennlp.sourceforge.net/) to find nouns in normal English and I'd recommend starting with that - it does most of your thinking for you. Otherwise look at any of the POSTaggers (http://en.wikipedia.org/wiki/POS_tagger) or (http://www-nlp.stanford.edu/links/statnlp.html#Taggers).


在计算机的词性标注中,
it
英语是典型的区分50
到150个单独的词性,例如,NN代表单数
常用名词,NNS代表复数常见
名词,NP代表单数专有名词名词
(参见布朗
语料库中使用的POS标签)

In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English, for example, NN for singular common nouns, NNS for plural common nouns, NP for singular proper nouns (see the POS tags used in the Brown Corpus)

有一个非常完整的列表 http://en.wikipedia.org/wiki/Natural_language_processing_toolkits 中的NLP工具包。我强烈建议您使用其中一个而不是尝试匹配Wordnet或其他集合。

There is a very full list of NLP toolkits in http://en.wikipedia.org/wiki/Natural_language_processing_toolkits. I would strongly suggest you use one of those rather than trying to match against Wordnet or other collections.

这篇关于从文本中提取名词(Java)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆