python中的wordnet词法化和pos标记 [英] wordnet lemmatization and pos tagging in python

查看:74
本文介绍了python中的wordnet词法化和pos标记的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在python中使用wordnet lemmatizer,并且我了解到默认pos标签为NOUN,并且除非为pos标签明确指定为VERB,否则它不会为动词输出正确的引理.

I wanted to use wordnet lemmatizer in python and I have learnt that the default pos tag is NOUN and that it does not output the correct lemma for a verb, unless the pos tag is explicitly specified as VERB.

我的问题是,为了准确地进行上述lemmatization,什么是最好的镜头?

My question is what is the best shot inorder to perform the above lemmatization accurately?

我使用nltk.pos_tag进行了pos标记,但我迷失了将树库pos标记集成到wordnet兼容的pos标记中的想法.请帮助

I did the pos tagging using nltk.pos_tag and I am lost in integrating the tree bank pos tags to wordnet compatible pos tags. Please help

from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
tagged = nltk.pos_tag(tokens)

我得到NN,JJ,VB,RB中的输出标签.如何将它们更改为与Wordnet兼容的标签?

I get the output tags in NN,JJ,VB,RB. How do I change these to wordnet compatible tags?

我还必须训练带有标记语料库的nltk.pos_tag()还是可以直接在数据中使用它来进行评估?

Also do I have to train nltk.pos_tag() with a tagged corpus or can I use it directly on my data to evaluate?

推荐答案

首先,您可以直接使用nltk.pos_tag()而不进行培训. 该函数将从文件中加载预训练的标记器.您可以看到文件名 与nltk.tag._POS_TAGGER:

First of all, you can use nltk.pos_tag() directly without training it. The function will load a pretrained tagger from a file. You can see the file name with nltk.tag._POS_TAGGER:

nltk.tag._POS_TAGGER
>>> 'taggers/maxent_treebank_pos_tagger/english.pickle' 

在接受Treebank语料库的培训后,它还使用了 Treebank标记集.

As it was trained with the Treebank corpus, it also uses the Treebank tag set.

以下功能会将树库标签映射到语音名称的WordNet部分:

The following function would map the treebank tags to WordNet part of speech names:

from nltk.corpus import wordnet

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

然后可以将返回值与lemmatizer一起使用:

You can then use the return value with the lemmatizer:

from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('going', wordnet.VERB)
>>> 'go'

在将返回值传递给Lemmatizer之前检查返回值,因为空字符串会产生KeyError.

Check the return value before passing it to the Lemmatizer because an empty string would give a KeyError.

这篇关于python中的wordnet词法化和pos标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆