python NLTK POS标记器行为异常 [英] python NLTK POS tagger not behaving as expected

查看:80
本文介绍了python NLTK POS标记器行为异常的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面的文本上运行了pos_tag函数,它将电池的输出显示为"RB".由于电池是名词,因此应显示为"NN".

I ran pos_tag function on below text,it shows output with battery as 'RB'. As battery is noun, it should show as 'NN'.

nltk.pos_tag(nltk.word_tokenize('Camera picture quality was fair but speed was an issue and also battery life was not that good'))

输出:

[('Camera','NNP'),('picture','NN'),('quality','NN'),('was', 'VBD'),('fair','JJ'),('but','CC'),('speed','NN'),('was', 'VBD'),('an','DT'),('issue','NN'),('and','CC'),('also','RB'), ('电池','RB'),('寿命','NN'),('was','VBD'),('非','RB'), ('that','IN'),('good','JJ')]

[('Camera', 'NNP'), ('picture', 'NN'), ('quality', 'NN'), ('was', 'VBD'), ('fair', 'JJ'), ('but', 'CC'), ('speed', 'NN'), ('was', 'VBD'), ('an', 'DT'), ('issue', 'NN'), ('and', 'CC'), ('also', 'RB'), ('battery', 'RB'), ('life', 'NN'), ('was', 'VBD'), ('not', 'RB'), ('that', 'IN'), ('good', 'JJ')]

如果我POS机通过此标记器标记了同一条语句,则 http://cst.dk/online/pos_tagger/uk/,它将电池显示为"NN"并提供以下输出:

While if I POS tagged the same statement by this tagger http://cst.dk/online/pos_tagger/uk/ , it shows battery as 'NN' and gives following output:

相机/NNP图片/NN质量/NN原/VBD公平/JJ但/CC速度/NN was/VBD an/DT问题/NN和/CC也/RB电池/NN寿命/NN was/VBD 不/RB认为/很好/JJ

Camera/NNP picture/NN quality/NN was/VBD fair/JJ but/CC speed/NN was/VBD an/DT issue/NN and/CC also/RB battery/NN life/NN was/VBD not/RB that/IN good/JJ

修改:

语句为:

相机的图片质量还算不错,但是速度是一个问题,但是 电池寿命不是很好"

"Camera picture quality was fair but speed was an issue but battery life was not that good"

NLTK标记器提供以下输出:

the NLTK tagger gives following output:

[('Camera','NNP'),('picture','NN'),('quality','NN'),('was', 'VBD'),('fair','JJ'),('but','CC'),('speed','NN'),('was', 'VBD'),('an','DT'),('issue','NN'),('but','CC'),('battery', 'NN'),('life','NN'),('was','VBD'),('not','RB'),('that','IN'), (好","JJ")]

[('Camera', 'NNP'), ('picture', 'NN'), ('quality', 'NN'), ('was', 'VBD'), ('fair', 'JJ'), ('but', 'CC'), ('speed', 'NN'), ('was', 'VBD'), ('an', 'DT'), ('issue', 'NN'), ('but', 'CC'), ('battery', 'NN'), ('life', 'NN'), ('was', 'VBD'), ('not', 'RB'), ('that', 'IN'), ('good', 'JJ')]

请解释!

推荐答案

似乎唯一的区别是cst.dk将battery标记为NN,而NLTK将Battery标记为RB(副词).

It seems like the only difference is that cst.dk tagged battery as NN and NLTK tagged battery as RB (adverb).

>>> cstdk_output = "Camera/NNP picture/NN quality/NN was/VBD fair/JJ but/CC speed/NN was/VBD an/DT issue/NN and/CC also/RB battery/NN life/NN was/VBD not/RB that/IN good/JJ"
>>> cstdk_postags = [tuple(j for j in i.split('/')) for i in cstdk_output.split()]
>>> from nltk import pos_tag
>>> sent = [i for i,j in cstdk_postags]
>>> nltk_postags = pos_tag(sent)
>>> diff = [(i[0],i[1],j[1]) for i,j in zip(cstdk_postags, nltk_postags) if i[1] != j[1]]
>>> diff
[('battery', 'NN', 'RB')]

没有太多要解释的东西.这是一个经过统计训练的系统,使用了最大熵,请参见 http://中的_POS_TAGGER www.nltk.org/_modules/nltk/tag.html#pos_tag ,因此注定会出错.请查看它造成的其他错误, POS标记-NLTK认为名词是形容词

There is not much to explain. It's a statistical trained system using Maximum Entropy, see _POS_TAGGER in http://www.nltk.org/_modules/nltk/tag.html#pos_tag, so it is bound to make mistake. See other mistakes it makes, POS tagging - NLTK thinks noun is adjective

这篇关于python NLTK POS标记器行为异常的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆