通过斯坦福解析器提取所有名词,形容词形式和文本 [英] Extracting all nouns, adjectives form and text via Stanford parser

查看:272
本文介绍了通过斯坦福解析器提取所有名词,形容词形式和文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试通过斯坦福解析器从给定的文本中提取所有名词和形容词.

I'm trying to extract all nouns and adjectives from a given text via the Stanford parser.

我当前的尝试是在Tree-Object的getChildrenAsList()中使用模式匹配来定位诸如以下内容:

My current attempt is using pattern matching in the getChildrenAsList() of the Tree-Object for locating things like:

(NN paper), (NN algorithm), (NN information), ...      

并将它们保存在数组中.

and saving them in an array.

输入句子:

本文提出了一种从任意文本中提取语义信息的算法.

In this paper we present an algorithm that extracts semantic information from an arbitrary text.

结果-字符串:

[(S (PP (IN In) (NP (DT this) (NN paper))) (NP (PRP we)) (VP (VBP present) (NP (NP (DT an) (NN algorithm)) (SBAR (WHNP (WDT that)) (S (VP (VBD extracts) (NP (JJ semantic) (NN information)) (PP (IN from) (NP (DT an) (ADJP (JJ arbitrary)) (NN text)))))))) (. .))]

我尝试使用模式匹配,因为在斯坦福解析器中找不到返回所有单词类(例如名词)的方法.

I try using pattern matching because i couldn't find a method in the Stanford parser that returns all word classes like nouns for example.

是否有更好的方法来提取这些单词类,或者解析器提供了特定的方法?

Is there a better way for extracting these words classes or does the parser provide specific methods?

public static void main(String[] args) {
    String str = "In this paper we present an algorithm that extracts semantic information from an arbitrary text.";
    LexicalizedParser lp = new LexicalizedParser("englishPCFG.ser.gz"); 
    Tree parseS = (Tree) lp.apply(str);
    System.out.println("tr.getChildrenAsList().toString()"+ parseS.getChildrenAsList().toString());
    }
}

推荐答案

顺便说一句,如果您想要的只是诸如名词和动词之类的语音部分,则应该只使用语音标记器,例如Stanford POS标记器.它将更快地运行几个数量级,并且至少要准确得多.

BTW, if all you want are parts of speech like nouns and verbs, you should just use a part of speech tagger, such as the Stanford POS tagger. It'll run a couple of orders of magnitude more quickly and be at least as accurate.

但是您可以使用解析器来做到这一点.您要的方法是taggedYield(),它返回一个List<TaggedWord>.所以你有

But you can do it with the parser. The method you want is taggedYield() which returns a List<TaggedWord>. So you have

List<TaggedWord> taggedWords = (Tree) lp.apply(str);
for (TaggedWord tw : taggedWords) {
  if (tw.tag().startsWith("N") || tw.tag().startsWith("J")) {
    System.out.printf("%s/%s%n", tw.word(), tw.tag());
  }
}

(此方法非常有用,因为它知道所有唯一的形容词和名词标签都以Penn树库标签集中的J或N开头.您通常可以检查一组标签中的成员资格.)

(This method cuts a corner, knowing that all and only adjective and noun tags start with J or N in the Penn treebank tag set. You could more generally check for membership in a set of tags.)

p.s. stanford-nlp标签最适合Stackoverflow上的Stanford NLP工具.

p.s. Use of the tag stanford-nlp is best for Stanford NLP tools on stackoverflow.

这篇关于通过斯坦福解析器提取所有名词,形容词形式和文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆