NLTK - 块语法不读取逗号 [英] NLTK - Chunk grammar doesn't read commas

查看:27
本文介绍了NLTK - 块语法不读取逗号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

from nltk.chunk.util import tagstr2tree从 nltk 导入 word_tokenize, pos_tagtext = "John Rose Center 是一个非常美丽的地方,我想和 Barbara Palvin 一起去那里.还有像阿迪达斯、耐克、锐步中心这样的商店."tagged_text = pos_tag(text.split())语法 = "NP:{+}"cp = nltk.RegexpParser(语法)结果 = cp.parse(tagged_text)打印(结果)

输出:

(S(NP 约翰/NNP 罗斯/NNP 中心/NNP)是/VBZ非常/RB美丽/JJ地点/神经网络和/CC客栈想要/VBP至/至去/VB那里/RB在里面(NP Barbara/NNP Palvin./NNP)还有/RB那里/EX是/VBP商店/神经网络像(NP 阿迪达斯/NNP、耐克/NNP、锐步/NNP 中心./NNP))

我用于分块的语法仅适用于 nnp 标签,但如果单词用逗号连续,它们仍将在同一行上.我希望我的块是这样的:

(S(NP 约翰/NNP 罗斯/NNP 中心/NNP)是/VBZ非常/RB美丽/JJ地点/神经网络和/CC客栈想要/VBP至/至去/VB那里/RB在里面(NP Barbara/NNP Palvin./NNP)还有/RB那里/EX是/VBP商店/神经网络像(NP阿迪达斯,/NNP)(NP 耐克,/NNP)(NP Reebok/NNP 中心./NNP))

我应该在grammar="中写什么,或者我可以像上面写的那样编辑输出?正如你所看到的,我只为我的命名实体项目解析专有名词,请帮助我.

解决方案

使用 word_tokenize(string) 而不是 string.split():

<预><代码>>>>导入 nltk>>>从 nltk.chunk.util 导入 tagstr2tree>>>从 nltk 导入 word_tokenize, pos_tag>>>text = "John Rose Center 是一个非常美丽的地方,我想和 Barbara Palvin 一起去那里.还有像阿迪达斯、耐克、锐步中心这样的商店.">>>tagged_text = pos_tag(word_tokenize(text))>>>>>>语法 = "NP:{+}">>>>>>cp = nltk.RegexpParser(语法)>>>结果 = cp.parse(tagged_text)>>>>>>打印(结果)(S(NP 约翰/NNP 罗斯/NNP 中心/NNP)是/VBZ非常/RB美丽/JJ地点/神经网络和/CC客栈想要/VBP至/至去/VB那里/RB在里面(NP 芭芭拉/NNP 帕尔文/NNP)./.还有/RB那里/EX是/VBP商店/神经网络像(NP 阿迪达斯/NNP),/,(NP 耐克/NNP),/,(NP Reebok/NNP 中心/NNP)./.)

from nltk.chunk.util import tagstr2tree
from nltk import word_tokenize, pos_tag
text = "John Rose Center is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike ,Reebok Center."
tagged_text = pos_tag(text.split())

grammar = "NP:{<NNP>+}"

cp = nltk.RegexpParser(grammar)
result = cp.parse(tagged_text)

print(result)

Output:

(S
  (NP John/NNP Rose/NNP Center/NNP)
  is/VBZ
  very/RB
  beautiful/JJ
  place/NN
  and/CC
  i/NN
  want/VBP
  to/TO
  go/VB
  there/RB
  with/IN
  (NP Barbara/NNP Palvin./NNP)
  Also/RB
  there/EX
  are/VBP
  stores/NNS
  like/IN
  (NP Adidas/NNP ,Nike/NNP ,Reebok/NNP Center./NNP))

The grammar i use for chunking only works on nnp tags but if words are sequential with commas they will still on the same line.I want my chunk like this:

(S
  (NP John/NNP Rose/NNP Center/NNP)
  is/VBZ
  very/RB
  beautiful/JJ
  place/NN
  and/CC
  i/NN
  want/VBP
  to/TO
  go/VB
  there/RB
  with/IN
  (NP Barbara/NNP Palvin./NNP)
  Also/RB
  there/EX
  are/VBP
  stores/NNS
  like/IN
  (NP Adidas,/NNP)
  (NP Nike,/NNP)
  (NP Reebok/NNP Center./NNP))

What should i write in the "grammar=" or can i edit the output like i wrote above?As you can see i only parse proper nouns for my named entity project pls help me out.

解决方案

Use word_tokenize(string) instead of string.split():

>>> import nltk
>>> from nltk.chunk.util import tagstr2tree
>>> from nltk import word_tokenize, pos_tag
>>> text = "John Rose Center is very beautiful place and i want to go there with Barbara Palvin. Also there are stores like Adidas ,Nike ,Reebok Center."
>>> tagged_text = pos_tag(word_tokenize(text))
>>> 
>>> grammar = "NP:{<NNP>+}"
>>> 
>>> cp = nltk.RegexpParser(grammar)
>>> result = cp.parse(tagged_text)
>>> 
>>> print(result)
(S
  (NP John/NNP Rose/NNP Center/NNP)
  is/VBZ
  very/RB
  beautiful/JJ
  place/NN
  and/CC
  i/NN
  want/VBP
  to/TO
  go/VB
  there/RB
  with/IN
  (NP Barbara/NNP Palvin/NNP)
  ./.
  Also/RB
  there/EX
  are/VBP
  stores/NNS
  like/IN
  (NP Adidas/NNP)
  ,/,
  (NP Nike/NNP)
  ,/,
  (NP Reebok/NNP Center/NNP)
  ./.)

这篇关于NLTK - 块语法不读取逗号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆