除非POS明确，否则WordNetLemmatizer不会返回正确的引理-Python NLTK [英] WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

查看：230 发布时间：2020/5/18 0:56:09 python nlp nltk wordnet lemmatization

本文介绍了除非POS明确，否则WordNetLemmatizer不会返回正确的引理-Python NLTK的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使Ted数据集抄本正式化.我注意到有些奇怪的地方: 并非所有单词都被残化.要说，

I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say,

selected -> select

哪个是对的.

但是，除非我明确输入'v'(动词)属性，否则involved !-> involve和horsing !-> horse.

However, involved !-> involve and horsing !-> horse unless I explicitly input the 'v' (Verb) attribute.

在python终端上，我得到了正确的输出，但没有得到我的代码:

On the python terminal, I get the right output but not in my code:

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'

代码的相关部分是这样的:

The relevant section of the code is this:

for l in LDA_Row[0].split('+'):
    w=str(l.split('*')[1])
    word=lmtzr.lemmatize(w)
    wordv=lmtzr.lemmatize(w,'v')
    print wordv, word
    # if word is not wordv:
    #   print word, wordv

整个代码位于此处.

出什么问题了?

推荐答案

lemmatizer需要正确的POS标记才能准确，如果使用WordNetLemmatizer.lemmatize()的默认设置，则默认标记为名词，请参见 https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39

The lemmatizer requires the correct POS tag to be accurate, if you use the default settings of the WordNetLemmatizer.lemmatize(), the default tag is noun, see https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39

要解决该问题，请始终在进行脱格处理之前先对您的数据进行POS标记，例如

To resolve the problem, always POS-tag your data before lemmatizing, e.g.

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     if not wntag:
...             lemma = word
...     else:
...             lemma = wnl.lemmatize(word, wntag)
...     print lemma
... 
This
be
a
foo
bar
sentence

请注意，是->存在"，即

Note that 'is -> be', i.e.

>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'

使用示例中的单词来回答问题:

To answer the question with words from your examples:

>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
...     wntag = tag[0].lower()
...     wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
...     lemma = wnl.lemmatize(word, wntag) if wntag else word
...     print lemma
... 
These
sentence
involve
some
horse
around

请注意，WordNetLemmatizer有一些怪癖:

Note that there are some quirks with WordNetLemmatizer:

wordnet lemmatization and pos tagging in python
Python NLTK Lemmatization of the word 'further' with wordnet

此外，NLTK的默认POS标记器正在进行一些重大更改以提高准确性:

Also NLTK's default POS tagger is under-going some major changes to improve accuracy:

Python NLTK pos_tag not returning the correct part-of-speech tag
https://github.com/nltk/nltk/issues/1110
https://github.com/nltk/nltk/pull/1143

对于lemmatizer的现成/现成解决方案，您可以查看

And for an out-of-the-box / off-the-shelf solution to lemmatizer, you can take a look at https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66

这篇关于除非POS明确，否则WordNetLemmatizer不会返回正确的引理-Python NLTK的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

除非POS明确，否则WordNetLemmatizer不会返回正确的引理-Python NLTK [英] WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

除非POS明确，否则WordNetLemmatizer不会返回正确的引理-Python NLTK [英] WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭