使用Python从nltk树结构中提取特定的叶子值 [英] Extracting specific leaf value from nltk tree structure with Python

查看:137
本文介绍了使用Python从nltk树结构中提取特定的叶子值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对NLTK的树函数有一些疑问.我正在尝试从树形结构中提取一个特定的单词,如下所示.

I have some questions about NLTK's tree functions. I am trying to extract a certain word from the tree structure like the one given below.

test = Tree.parse('(ROOT(SBARQ(WHADVP(WRB How))(SQ(VBP do)(NP (PRP you))(VP(VB ask)(NP(DT a)(JJ total)(NN stranger))(PRT (RP out))(PP (IN on)(NP (DT a)(NN date)))))))')

print "Input tree: ", test
print test.leaves()

(SBARQ
    (WHADVP (WRB How))
    (SQ
      (VBP do)
      (NP (PRP you))
      (VP
        (VB ask)
        (NP (DT a) (JJ total) (NN stranger))
        (PRT (RP out))
        (PP (IN on) (NP (DT a) (NN date)))))))

['How', 'do', 'you', 'ask', 'a', 'total', 'stranger', 'out', 'on', 'a', 'date']

我可以使用leaves()函数找到所有单词的列表.有没有办法只获得特定的叶子?例如:我只想从NP短语中获得第一个/最后一个名词?答案将是第一个名词为陌生人",最后一个名词为日期".

I can find a list of all the words using the leaves() function. Is there a way to get a specific leaf only? For example: I would like to get the first/last noun from the NP phrase only? The answer would be 'stranger' for the first noun and 'date' as the last noun.

推荐答案

尽管名词短语可以嵌套在其他类型的短语中,但我相信大多数语法在名词短语中总是带有名词.因此,您的问题可能可以改写为:如何找到第一个和最后一个名词?

Although noun phrases can be nested inside other types of phrases, I believe most grammars always have nouns in noun phrases. So your question can probably be rephrased as: How do you find the first and last nouns?

您只需获取单词和POS标签的所有tuple并像这样进行过滤,

You can simply get all tuples of words and POS tags and filter like this,

>>> [word for word,pos in test.pos() if pos=='NN']
['stranger', 'date']

在这种情况下,只有两个,您就完成了.如果您有更多的名词,则只需在[0][-1]处为列表建立索引.

Which in this case is only two so you're done. If you had more nouns, you would just index the list at [0] and [-1].

如果您正在寻找另一种可以用在不同短语中的POS,但是您只想在特定的POS中使用它,或者您有一个奇怪的语法允许名词在NP之外,则可以执行以下操作...

If you were looking for another POS that could be used in different phrases but you only wanted its use inside a particular one or if you had a strange grammar that allowed nouns outside of NPs, you can do the following...

您可以通过这样做找到'NP'subtrees

You can find subtrees of 'NP' by doing,

>>> NPs = list(test.subtrees(filter=lambda x: x.node=='NP'))
>>> NPs
[Tree('NP', [Tree('PRP', ['you'])]), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['total']), Tree('NN', ['stranger'])]), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['date'])])]

继续缩小子树的范围,我们可以使用此结果查找'NN'个单词,

Continuing to narrow down the subtrees, we can use this result to look for 'NN' words,

>>> NNs_inside_NPs = map(lambda x: list(x.subtrees(filter=lambda x: x.node=='NN')), NPs)
>>> NNs_inside_NPs
[[], [Tree('NN', ['stranger'])], [Tree('NN', ['date'])]]

因此,这是每个'NP'短语内所有'NN'listlist个.在这种情况下,每个短语中恰好只有零个或一个名词.

So this is a list of lists of all the 'NN's inside each 'NP' phrases. In this case there happens to only be zero or one noun in each phrase.

现在,我们只需要遍历'NP' s并获得所有单个名词的leaves(这实际上意味着我们只想访问Tree('NN', ['stranger'])'stranger'部分).

Now we just need to go through the 'NP's and get all the leaves of the individual nouns (which really means we just want to access the 'stranger' part of Tree('NN', ['stranger'])).

>>> [noun.leaves()[0] for nouns in NNs_inside_NPs for noun in nouns]
['stranger', 'date']

这篇关于使用Python从nltk树结构中提取特定的叶子值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆