在NLTK pos_tag中,为什么要"hello"?被归为名词吗? [英] In NLTK pos_tag, why "hello" is classified as Noun?

查看:70
本文介绍了在NLTK pos_tag中,为什么要"hello"?被归为名词吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试过

text = nltk.word_tokenize("hello, my name is John")
words = nltk.pos_tag(text)

for w in words:
    print "%s = %s" % (w[0], w[1])

然后我得到了

hello = NN
, = ,
my = PRP$
name = NN
is = VBZ
John = NNP

推荐答案

根据Penn Treebank标签集, hello 绝对是一种感叹词,并且始终标记为UH.您遇到的问题是NLTK附带的标记器很可能是在Penn Treebank的《华尔街日报》部分免费提供的,而不幸的是,对于您来说,其中的 hello ,只有三个单词标记为UH(感叹词).如果要标记语音文本,则需要在整个Penn Treebank上训练标记器,其中包括大约300万个英语口语单词.

According to the Penn Treebank tagset, hello is definitely an interjection and is consistently tagged UH. The problem you're running into is that the taggers that NLTK ships with were most likely trained on the part of the Wall Street Journal section of the Penn Treebank that is available for free, which unfortunately for you contains zero occurrences of the word hello and only three words tagged UH (interjection). If you want to tag spoken text, you'll need to train your tagger on the whole Penn Treebank, which includes something like 3 million words of spoken English.

顺便说一句,NLTK标记者不会总是称呼 hello 为名词-尝试标记别问我!"或他打招呼".

By the way, the NLTK taggers won't always call hello a noun -- try tagging "don't hello me!" or "he said hello".

这篇关于在NLTK pos_tag中,为什么要"hello"?被归为名词吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆