Python NLTK pos_tag未返回正确的词性标签 [英] Python NLTK pos_tag not returning the correct part-of-speech tag

查看：80 发布时间：2020/5/4 8:49:34 python machine-learning nlp nltk pos-tagger

本文介绍了Python NLTK pos_tag未返回正确的词性标签的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

具有:

text = word_tokenize("The quick brown fox jumps over the lazy dog")

正在运行:

nltk.pos_tag(text)

我得到:

[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

这是不正确的.句子中quick brown lazy的标签应为:

This is incorrect. The tags for quick brown lazy in the sentence should be:

('quick', 'JJ'), ('brown', 'JJ') , ('lazy', 'JJ')

通过其在线工具进行测试，结果相同. quick，brown和fox应该是形容词而不是名词.

Testing this through their online tool gives the same result; quick, brown and fox should be adjectives not nouns.

推荐答案

简而言之:

NLTK并不完美.实际上，没有任何模型是完美的.

NLTK is not perfect. In fact, no model is perfect.

注意:

从NLTK版本3.1开始，默认的pos_tag函数不再是

As of NLTK version 3.1, default pos_tag function is no longer the old MaxEnt English pickle.

现在是

>>> import inspect
>>> print inspect.getsource(pos_tag)
def pos_tag(tokens, tagset=None):
    tagger = PerceptronTagger()
    return _pos_tag(tokens, tagset, tagger)

还是更好，但还不完美:

Still it's better but not perfect:

>>> from nltk import pos_tag
>>> pos_tag("The quick brown fox jumps over the lazy dog".split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

在某些时候，如果有人想要TL;DR解决方案，请参见 https://github.com/alvations/nltk_cli

At some point, if someone wants TL;DR solutions, see https://github.com/alvations/nltk_cli

很久:

尝试使用其他标记器(请参见 https://github.com /nltk/nltk/tree/develop/nltk/tag )，例如:

Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.:

HunPos
斯坦福POS
塞纳

使用NLTK中的默认MaxEnt POS标记器，即nltk.pos_tag :

Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag:

>>> from nltk import word_tokenize, pos_tag
>>> text = "The quick brown fox jumps over the lazy dog"
>>> pos_tag(word_tokenize(text))
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]

使用斯坦福POS标记器:

$ cd ~
$ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip
$ unzip stanford-postagger-2015-04-20.zip
$ mv stanford-postagger-2015-04-20 stanford-postagger
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.stanford import POSTagger
>>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger'
>>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar'
>>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]

使用HunPOS (注意:默认编码是ISO-8859-1而不是UTF8):

Using HunPOS (NOTE: the default encoding is ISO-8859-1 not UTF8):

$ cd ~
$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz 
$ mv en_wsj.model hunpos-1.0-linux/
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.hunpos import HunposTagger
>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'
>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'
>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)
>>> text = "The quick brown fox jumps over the lazy dog"
>>> ht.tag(text.split())
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

使用Senna (确保您使用的是最新版本的NLTK，并且对API进行了一些更改):

Using Senna (Make sure you've the latest version of NLTK, there were some changes made to the API):

$ cd ~
$ wget http://ronan.collobert.com/senna/senna-v3.0.tgz
$ tar zxvf senna-v3.0.tgz
$ python
>>> from os.path import expanduser
>>> home = expanduser("~")
>>> from nltk.tag.senna import SennaTagger
>>> st = SennaTagger(home+'/senna')
>>> text = "The quick brown fox jumps over the lazy dog"
>>> st.tag(text.split())
[('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]

或者尝试构建更好的POS标记器:

Ngram Tagger: http: //streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/
Affix/Regex Tagger: http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2/
构建自己的Brill(阅读代码，这是一个非常有趣的标记器， http: //www.nltk.org/_modules/nltk/tag/brill.html )，请参见 http://scm.io/blog/hack/2015/02/lda-intentions/

Ngram Tagger: http://streamhacker.com/2008/11/03/part-of-speech-tagging-with-nltk-part-1/
Affix/Regex Tagger: http://streamhacker.com/2008/11/10/part-of-speech-tagging-with-nltk-part-2/
Build Your Own Brill (Read the code it's a pretty fun tagger, http://www.nltk.org/_modules/nltk/tag/brill.html), see http://streamhacker.com/2008/12/03/part-of-speech-tagging-with-nltk-part-3/
Perceptron Tagger: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
LDA Tagger: http://scm.io/blog/hack/2015/02/lda-intentions/

有关堆栈溢出的pos_tag精度的投诉包括:

Complains about pos_tag accuracy on stackoverflow include:

POS tagging - NLTK thinks noun is adjective
python NLTK POS tagger not behaving as expected
How to obtain better results using NLTK pos tag
pos_tag in NLTK does not tag sentences correctly

有关NLTK HunPos的问题:

如何在nltk中用hunpos标记文本文件? /a>

有人知道如何在nltk上配置hunpos包装器类?

NLTK和斯坦福POS标记器的问题包括:

trouble importing stanford pos tagger into nltk
Java Command Fails in NLTK Stanford POS Tagger
Error using Stanford POS Tagger in NLTK Python
How to improve speed with Stanford NLP Tagger and NLTK
Nltk stanford pos tagger error : Java command failed
Instantiating and using StanfordTagger within NLTK
Running Stanford POS tagger in NLTK leads to "not a valid Win32 application" on Windows

这篇关于Python NLTK pos_tag未返回正确的词性标签的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python NLTK pos_tag未返回正确的词性标签 [英] Python NLTK pos_tag not returning the correct part-of-speech tag

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

Python NLTK pos_tag未返回正确的词性标签 [英] Python NLTK pos_tag not returning the correct part-of-speech tag

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭