用wordnet lemmatizer阻止某些复数不起作用 [英] Stemming some plurals with wordnet lemmatizer doesn't work

查看:486
本文介绍了用wordnet lemmatizer阻止某些复数不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对nltk(2.0.4)有问题: 我正在尝试阻止男人"或牙齿"一词,但它似乎不起作用. 这是我的代码:

Hi i've a problem with nltk (2.0.4): I'm trying to stemming the word 'men' or 'teeth' but it doesn't seem to work. Here's my code:

############################################################################
import nltk
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer

lmtzr=WordNetLemmatizer()
words_raw = "men teeth"
words = nltk.word_tokenize(words_raw)
for word in words:
        print 'WordNet Lemmatizer NOUN: ' + lmtzr.lemmatize(word, wn.NOUN)
#############################################################################

这应该打印"man"和"tooth",但打印"man"和"teeth".

This should print 'man' and 'tooth' but instead it prints 'men' and 'teeth'.

有什么解决方案吗?

推荐答案

我找到了解决方案! 我检查了wordnet.py中的文件/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader,发现函数_morphy(self,form,pos)返回包含词干的列表. 所以我尝试测试_morphy:

I found the solution! I checked the files in wordnet.py the folder /usr/local/lib/python2.6/dist-packages/nltk/corpus/reader and i noticed that the function _morphy(self,form,pos) returns a list containing stemmed words. So i tried to test _morphy :

import nltk
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer

words_raw = "men teeth books"
words = nltk.word_tokenize(words_raw)
for word in words:
        print wn._morphy(word, wn.NOUN)

该程序可以打印[男人,男人],[牙齿,牙齿]和[书籍]!

This program prints [men,man], [teeth,tooth] and [book]!

为什么lmtzr.lemmatize()仅打印列表的第一个元素的解释,也许可以在lemmatize函数中找到,该函数包含在文件夹/usr/local/中的文件'wordnet.py'中lib/python2.6/dist-packages/nltk/stem.

the explanation of why lmtzr.lemmatize () prints only the first element of the list, perhaps it can be found in the function lemmatize, contained in the file 'wordnet.py' which is in the folder /usr/local/lib/python2.6/dist-packages/nltk/stem.

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

我假设它仅返回单词列表中包含的较短单词,并且如果两个单词的长度相等,则返回第一个;例如男人"或牙齿",而不是男人"和牙齿"

I assume that it returns only the shorter word contained in the word list, and if the two words are of equal length it returns the first one; for example 'men' or 'teeth'rather than 'man' and 'tooth'

这篇关于用wordnet lemmatizer阻止某些复数不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆