python中的wordnet词法化和pos标记 [英] wordnet lemmatization and pos tagging in python
问题描述
我想在python中使用wordnet lemmatizer,并且我了解到默认pos标签为NOUN,并且除非为pos标签明确指定为VERB,否则它不会为动词输出正确的引理.
I wanted to use wordnet lemmatizer in python and I have learnt that the default pos tag is NOUN and that it does not output the correct lemma for a verb, unless the pos tag is explicitly specified as VERB.
我的问题是,为了准确地进行上述lemmatization,什么是最好的镜头?
My question is what is the best shot inorder to perform the above lemmatization accurately?
我使用nltk.pos_tag
进行了pos标记,但我迷失了将树库pos标记集成到wordnet兼容的pos标记中的想法.请帮助
I did the pos tagging using nltk.pos_tag
and I am lost in integrating the tree bank pos tags to wordnet compatible pos tags. Please help
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)
我得到NN,JJ,VB,RB中的输出标签.如何将它们更改为与Wordnet兼容的标签?
I get the output tags in NN,JJ,VB,RB. How do I change these to wordnet compatible tags?
我还必须训练带有标记语料库的nltk.pos_tag()
还是可以直接在数据中使用它来进行评估?
Also do I have to train nltk.pos_tag()
with a tagged corpus or can I use it directly on my data to evaluate?
推荐答案
首先,您可以直接使用nltk.pos_tag()
而不进行培训.
该函数将从文件中加载预训练的标记器.您可以看到文件名
与nltk.tag._POS_TAGGER
:
First of all, you can use nltk.pos_tag()
directly without training it.
The function will load a pretrained tagger from a file. You can see the file name
with nltk.tag._POS_TAGGER
:
nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle'
在接受Treebank语料库的培训后,它还使用了 Treebank标记集.
As it was trained with the Treebank corpus, it also uses the Treebank tag set.
以下功能会将树库标签映射到语音名称的WordNet部分:
The following function would map the treebank tags to WordNet part of speech names:
from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return ''
然后可以将返回值与lemmatizer一起使用:
You can then use the return value with the lemmatizer:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'
在将返回值传递给Lemmatizer之前检查返回值,因为空字符串会产生KeyError
.
Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError
.
这篇关于python中的wordnet词法化和pos标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!