Python 和 NLTK:如何分析句子语法? [英] Python and NLTK: How to analyze sentence grammar?

查看:29
本文介绍了Python 和 NLTK:如何分析句子语法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这段代码,它应该根据定义的语法显示句子的句法结构.但是它返回一个空的 [].我错过了什么或做错了什么?

import nltk语法 = nltk.parse_cfg("""S->副总裁PP ->NPNP->检测 N |检测 N PP副总裁 ->V NP |副总裁N->'金' |'达娜' |'每个人'V ->'到达' |'左' |'欢呼'P->'或' |'和'""")定义主():sent = "Kim 到达或 Dana 离开,每个人都欢呼".split()解析器 = nltk.ChartParser(语法)树 = parser.nbest_parse(sent)对于树中的树:打印树如果 __name__ == '__main__':主要的()

解决方案

让我们做一些逆向工程:

<预><代码>>>>导入 nltk>>>语法 = nltk.parse_cfg("""... NP ->检测 N |检测 N PP... N ->'金' |'达娜' |'每个人'... """)>>>sent = "Kim".split()>>>解析器 = nltk.ChartParser(语法)>>>打印 parser.nbest_parse(sent)[]

似乎规则甚至无法将第一部作品识别为 NP.所以让我们尝试注入 NP ->N

<预><代码>>>>导入 nltk>>>语法 = nltk.parse_cfg("""... NP ->检测 N |检测 N PP |N... N ->'金' |'达娜' |'每个人'... """)>>>sent = "Kim".split()>>>解析器 = nltk.ChartParser(语法)>>>打印 parser.nbest_parse(sent)[树('NP', [树('N', ['Kim'])])]

所以现在它可以工作了,让我们继续Kim 到达或 Dana 和:

<预><代码>>>>导入 nltk>>>语法 = nltk.parse_cfg("""... S ->副总裁... PP ->NP... NP ->检测 N |检测 N PP |N... 副总裁 ->V NP |副总裁... N ->'金' |'达娜' |'每个人'... V ->'到达' |'左' |'欢呼'... P ->'或' |'和'... """)>>>sent = "金到了".split()>>>解析器 = nltk.ChartParser(语法)>>>打印 parser.nbest_parse(sent)[]>>>>>>sent = "金到了或".split()>>>解析器 = nltk.ChartParser(语法)>>>打印 parser.nbest_parse(sent)[]

似乎没有办法在有或没有 P 的情况下获得 VP,因为 V 需要一个 NP 之后,或者它必须在获取 P 之前爬上树成为 VP,所以它放宽规则并说 VP ->V PP 而不是 VP ->VP PP:

<预><代码>>>>导入 nltk>>>语法 = nltk.parse_cfg("""... S ->副总裁... PP ->NP... NP ->检测 N |检测 N PP |N... 副总裁 ->V NP |电压... N ->'金' |'达娜' |'每个人'... V ->'到达' |'左' |'欢呼'... P ->'或' |'和'... """)>>>sent = "Kim 到达或 Dana".split()>>>解析器 = nltk.ChartParser(语法)>>>打印 parser.nbest_parse(sent)[Tree('S', [Tree('NP', [Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['arrived']),Tree('PP', [Tree('P', ['or']), Tree('NP', [Tree('N', ['Dana'])])])])])]

好吧,我们越来越近了,但似乎下一个词又打破了 cfg 规则:

<代码>>>导入 nltk>>>语法 = nltk.parse_cfg("""... S ->副总裁... PP ->NP... NP ->检测 N |检测 N PP |N... 副总裁 ->V NP |电压... N ->'金' |'达娜' |'每个人'... V ->'到达' |'左' |'欢呼'... P ->'或' |'和'... """)>>>sent = "Kim 到达或 Dana 离开".split()>>>解析器 = nltk.ChartParser(语法)>>>打印 parser.nbest_parse(sent)[]>>>sent = "Kim 到达或 Dana 离开".split()>>>解析器 = nltk.ChartParser(语法)>>>打印 parser.nbest_parse(sent)[]>>>>>>sent = "Kim 到了或 Dana 离开了,大家".split()>>>解析器 = nltk.ChartParser(语法)>>>打印 parser.nbest_parse(sent)[]>>>>>>sent = "Kim 到达或 Dana 离开,每个人都欢呼".split()>>>解析器 = nltk.ChartParser(语法)>>>打印 parser.nbest_parse(sent)[]

所以我希望上面的例子告诉你,试图改变规则以从左到右合并语言现象是很困难的.

与其从左往右做,不如实现

[[[[[[[[Kim] 到达] 或] Dana] 离开] 和] 所有人] 欢呼]

为什么不尝试制定更多语言上合理的规则来实现:

  1. [[[Kim 到了] 或 [Dana 离开]] 和[大家欢呼]]
  2. [[Kim 到了] 或 [[Dana 离开] 和 [大家欢呼]]]

试试这个:

import nltk语法 = nltk.parse_cfg("""S->CP |副总裁CP ->副总裁 副总裁 |CP C 副总裁 |副总裁副总裁 ->净值NP->'金' |'达娜' |'每个人'V ->'到达' |'左' |'欢呼'C->'或' |'和'""")打印=======金到了==========sent = "金到了".split()解析器 = nltk.ChartParser(语法)对于 parser.nbest_parse(sent) 中的 t:打印 t打印 "
======== Kim 到达或 Dana 离开 ========="sent = "Kim 到达或 Dana 离开".split()解析器 = nltk.ChartParser(语法)对于 parser.nbest_parse(sent) 中的 t:打印 t打印 "
=== Kim 到达或 Dana 离开,每个人都欢呼 ===="sent = "Kim 到达或 Dana 离开,每个人都欢呼".split()解析器 = nltk.ChartParser(语法)对于 parser.nbest_parse(sent) 中的 t:打印 t

[out]:

======== Kim 到了 ==========(S (VP (NP Kim) (V 到了)))======== Kim 到达或 Dana 离开 =========(S(CP(VP(NP Kim)(V到达))(C或)(VP(NP Dana)(V左))))=== Kim 到了或者 Dana 离开了,每个人都欢呼了 ====(S(CP(CP(VP(NP Kim) (V到达)) (C or) (VP (NP Dana) (V left)))(C 和)(VP(NP大家)(V欢呼))))(S(CP(VP (NP Kim) (V 到了))(C 或)(CP(VP (NP Dana) (V 左))(C 和)(VP(NP大家)(V欢呼)))))

上述解决方案表明您的 CFG 规则需要足够强大,不仅可以捕获完整的句子,还可以捕获部分句子.

I have this code which should show the syntactic structure of the sentence according to defined grammar. However it is returning an empty []. What am I missing or doing wrong?

import nltk

grammar = nltk.parse_cfg("""
S -> NP VP 
PP -> P NP
NP -> Det N | Det N PP 
VP -> V NP | VP PP
N -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
P -> 'or' | 'and'
""")

def main():
    sent = "Kim arrived or Dana left and everyone cheered".split()
    parser = nltk.ChartParser(grammar)
    trees = parser.nbest_parse(sent)
    for tree in trees:
        print tree

if __name__ == '__main__':
    main()

解决方案

Let's do some reverse engineering:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

Seems like the rules can't recognize even the first work as NP. So let's try injecting NP -> N

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... NP -> Det N | Det N PP | N
... N -> 'Kim' | 'Dana' | 'everyone'
... """)
>>> sent = "Kim".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('NP', [Tree('N', ['Kim'])])]

So now it's working, let's continue Kim arrived or Dana and:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | VP PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

Seem like there is no way to get the VP with or without the P, since V requires either an NP after, or it has to go up the tree to be a VP before taking a P, so it's relax the rules and say VP -> V PP instead of VP -> VP PP:

>>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[Tree('S', [Tree('NP', [Tree('N', ['Kim'])]), Tree('VP', [Tree('V', ['arrived']), Tree('PP', [Tree('P', ['or']), Tree('NP', [Tree('N', ['Dana'])])])])])]

Okay, we are getting closer, but seems like the next word broke the cfg rules again:

>> import nltk
>>> grammar = nltk.parse_cfg("""
... S -> NP VP
... PP -> P NP
... NP -> Det N | Det N PP | N
... VP -> V NP | V PP
... N -> 'Kim' | 'Dana' | 'everyone'
... V -> 'arrived' | 'left' |'cheered'
... P -> 'or' | 'and'
... """)
>>> sent = "Kim arrived or Dana left".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> sent = "Kim arrived or Dana left and".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or Dana left and everyone".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]
>>> 
>>> sent = "Kim arrived or Dana left and everyone cheered".split()
>>> parser = nltk.ChartParser(grammar)
>>> print parser.nbest_parse(sent)
[]

So I hope the above example shows you that trying to change the rules to incorporate language phenomenon from left to right is hard.

Instead of doing it from left to right, and achieve

[[[[[[[[Kim] arrived] or] Dana] left] and] everyone] cheered]

why don't you try to make more linguistically sound rules to achieve:

  1. [[[Kim arrived] or [Dana left]] and [everyone cheered]]
  2. [[Kim arrived] or [[Dana left] and [everyone cheered]]]

Try this instead:

import nltk
grammar = nltk.parse_cfg("""
S -> CP | VP 
CP -> VP C VP | CP C VP | VP C CP
VP -> NP V 
NP -> 'Kim' | 'Dana' | 'everyone'
V -> 'arrived' | 'left' |'cheered'
C -> 'or' | 'and'
""")

print "======= Kim arrived ========="
sent = "Kim arrived".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
    print t

print "
======= Kim arrived or Dana left ========="
sent = "Kim arrived or Dana left".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
    print t 

print "
=== Kim arrived or Dana left and everyone cheered ===="
sent = "Kim arrived or Dana left and everyone cheered".split()
parser = nltk.ChartParser(grammar)
for t in parser.nbest_parse(sent):
    print t

[out]:

======= Kim arrived =========
(S (VP (NP Kim) (V arrived)))

======= Kim arrived or Dana left =========
(S (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left))))

=== Kim arrived or Dana left and everyone cheered ====
(S
  (CP
    (CP (VP (NP Kim) (V arrived)) (C or) (VP (NP Dana) (V left)))
    (C and)
    (VP (NP everyone) (V cheered))))
(S
  (CP
    (VP (NP Kim) (V arrived))
    (C or)
    (CP
      (VP (NP Dana) (V left))
      (C and)
      (VP (NP everyone) (V cheered)))))

The above solution show how your CFG rules needs to be robust enough to not only capture the full sentence but also part of the sentence too.

这篇关于Python 和 NLTK:如何分析句子语法?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆