叶子在NLTK树中的绝对位置 [英] Absolute position of leaves in NLTK tree

查看:169
本文介绍了叶子在NLTK树中的绝对位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试查找给定句子中名词短语的跨度(开始索引,结束索引).以下是提取名词短语的代码

I am trying to find the span (start index, end index) of a noun phrase in a given sentence. The following is the code for extracting noun phrases

sent=nltk.word_tokenize(a)
sent_pos=nltk.pos_tag(sent)
grammar = r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns

    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
    VP:
        {<VBD><PP>?}
        {<VBZ><PP>?}
        {<VB><PP>?}
        {<VBN><PP>?}
        {<VBG><PP>?}
        {<VBP><PP>?}
"""

cp = nltk.RegexpParser(grammar)
result = cp.parse(sent_pos)
nounPhrases = []
for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'):
  np = ''
  for x in subtree.leaves():
    np = np + ' ' + x[0]
  nounPhrases.append(np.strip())

For a =美国内战,也称为国家之间的战争,或简称为内战,是美国南部奴隶州宣布脱离美国后于1861年至1865年进行的内战并组成了美国联邦.",提取的名词短语是

For a = "The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.", the noun phrases extracted are

[美国内战",战争",国家",内战",内战",美国",几个南方",国家",分裂", 同盟国",美国"].

现在,我需要找到名词短语的范围(短语的开始位置和结束位置).例如,以上名词短语的跨度将为

Now I need to find the span (start position and end position of the phrase) of noun phrases. For example, the span of above noun phrases will be

[(1,3),(9,9),(12,12),(16,17),(21,23),....] .

我是NLTK的新手,我研究了 http://www .nltk.org/_modules/nltk/tree.html .我尝试使用 Tree.treepositions(),但是我无法使用这些索引来提取绝对位置.任何帮助将不胜感激.谢谢!

I'm fairly new to NLTK and I've looked into http://www.nltk.org/_modules/nltk/tree.html. I tried to use Tree.treepositions() but I couldn't manage to extract absolute positions using these indices. Any help would be greatly appreciated. Thank You!

推荐答案

没有任何隐式函数返回由

There isn't any implicit function that returns the offsets of strings/tokens as highlighted by https://github.com/nltk/nltk/issues/1214

但是您可以使用 RIBES得分来自 https://github.com/nltk /nltk/blob/develop/nltk/translate/ribes_score.py#L123

>>> from nltk import word_tokenize
>>> from nltk.translate.ribes_score import position_of_ngram
>>> s = word_tokenize("The American Civil War, also known as the War between the States or simply the Civil War, was a civil war fought from 1861 to 1865 in the United States after several Southern slave states declared their secession and formed the Confederate States of America.")
>>> position_of_ngram(tuple('American Civil War'.split()), s)
1
>>> position_of_ngram(tuple('Confederate States of America'.split()), s)
43

(它返回查询ngram的起始位置)

(It returns the starting position of the query ngram)

这篇关于叶子在NLTK树中的绝对位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆