Python NLTK用wordnet对“进一步"一词的词法化 [英] Python NLTK Lemmatization of the word 'further' with wordnet
问题描述
我正在使用python,NLTK和WordNetLemmatizer进行lemmatizer. 这是输出我期望的随机文本
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective
输出:'bad'
lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb
输出:'worse'
好吧,这里的一切都很好.其行为与其他形容词如'better'
(对于不规则形式)或'older'
相同(请注意,与'elder'
相同的测试将永远不会输出'old'
,但我想wordnet并不是详尽的列表)所有现有的英语单词)
当我尝试使用单词'furter'
时,我的问题来了:
lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective
输出:'further'
lem.lemmatize('further', pos=wordnet.ADV) // as an adverb
输出:'far'
这与'worse'
单词的行为完全相反!
有人能解释我为什么吗?是来自wordnet同义词集数据的bug,还是我对英语语法的误解?
如果这个问题已经回答,请原谅,我已经在google和SO上进行搜索,但是当指定关键字"further"时,由于该词的流行,我发现除了乱七八糟之外,其他任何相关的东西……>
预先感谢您, 罗曼·G.
WordNetLemmatizer
使用._morphy
函数访问其单词的引理.来自 http://www.nltk.org/_modules/nltk/stem/wordnet. html 并以最小长度返回可能的引理.
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
._morphy
函数迭代地应用规则以获得引理.规则会不断减少单词的长度,并用MORPHOLOGICAL_SUBSTITUTIONS
代替词缀.然后查看是否有其他单词较短但与简化单词相同:
def _morphy(self, form, pos):
# from jordanbg:
# Given an original string x
# 1. Apply rules once to the input to get y1, y2, y3, etc.
# 2. Return all that are in the database
# 3. If there are no matches, keep applying rules until you either
# find a match or you can't go any further
exceptions = self._exception_map[pos]
substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
def apply_rules(forms):
return [form[:-len(old)] + new
for form in forms
for old, new in substitutions
if form.endswith(old)]
def filter_forms(forms):
result = []
seen = set()
for form in forms:
if form in self._lemma_pos_offset_map:
if pos in self._lemma_pos_offset_map[form]:
if form not in seen:
result.append(form)
seen.add(form)
return result
# 0. Check the exception lists
if form in exceptions:
return filter_forms([form] + exceptions[form])
# 1. Apply rules once to the input to get y1, y2, y3, etc.
forms = apply_rules([form])
# 2. Return all that are in the database (and check the original too)
results = filter_forms([form] + forms)
if results:
return results
# 3. If there are no matches, keep applying rules until we find a match
while forms:
forms = apply_rules(forms)
results = filter_forms(forms)
if results:
return results
# Return an empty list if we can't find anything
return []
但是,如果单词在例外列表中,它将返回保存在exceptions
中的固定值,请参见
def _load_exception_map(self):
# load the exception file data into memory
for pos, suffix in self._FILEMAP.items():
self._exception_map[pos] = {}
for line in self.open('%s.exc' % suffix):
terms = line.split()
self._exception_map[pos][terms[0]] = terms[1:]
self._exception_map[ADJ_SAT] = self._exception_map[ADJ]
回到您的示例,不能从规则中实现worse
-> bad
和further
-> far
,因此必须从异常列表中获得.由于这是例外列表,因此肯定会有不一致之处.
例外列表保存在~/nltk_data/corpora/wordnet/adv.exc
和~/nltk_data/corpora/wordnet/adv.exc
中.
来自adv.exc
:
best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard
来自adj.exc
:
...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...
I'm working on a lemmatizer using python, NLTK and the WordNetLemmatizer. Here is a random text that output what I was expecting
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective
Output: 'bad'
lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb
Output: 'worse'
Well, everything here is fine. The behaviour is the same with other adjectives like 'better'
(for an irregular form) or 'older'
(note that the same test with 'elder'
will never output 'old'
, but I guess that wordnet is not an exhaustive list of all the existing english word)
My question comes when trying with the word 'furter'
:
lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective
Output: 'further'
lem.lemmatize('further', pos=wordnet.ADV) // as an adverb
Output: 'far'
This is the exact opposite behaviour of the one for the 'worse'
word!
Can anybody explain me why ? Is it a bug coming from the wordnet synsets data or does it come from my misunderstanding of the english grammar ?
Please excuse me if the question is already answered, I've search on google and SO, but when specifying the keyword "further", I can find anything related but mess because of the popularity of this word...
Thank you in advance, Romain G.
WordNetLemmatizer
uses the ._morphy
function to access its a word's lemma; from http://www.nltk.org/_modules/nltk/stem/wordnet.html and returns the possible lemmas with the minimum length.
def lemmatize(self, word, pos=NOUN):
lemmas = wordnet._morphy(word, pos)
return min(lemmas, key=len) if lemmas else word
And the ._morphy
function apply rules iteratively to get a lemma; the rules keep reducing the length of the word and substituting the affixes with the MORPHOLOGICAL_SUBSTITUTIONS
. then it sees whether there are other words that are shorter but the same as the reduced word:
def _morphy(self, form, pos):
# from jordanbg:
# Given an original string x
# 1. Apply rules once to the input to get y1, y2, y3, etc.
# 2. Return all that are in the database
# 3. If there are no matches, keep applying rules until you either
# find a match or you can't go any further
exceptions = self._exception_map[pos]
substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]
def apply_rules(forms):
return [form[:-len(old)] + new
for form in forms
for old, new in substitutions
if form.endswith(old)]
def filter_forms(forms):
result = []
seen = set()
for form in forms:
if form in self._lemma_pos_offset_map:
if pos in self._lemma_pos_offset_map[form]:
if form not in seen:
result.append(form)
seen.add(form)
return result
# 0. Check the exception lists
if form in exceptions:
return filter_forms([form] + exceptions[form])
# 1. Apply rules once to the input to get y1, y2, y3, etc.
forms = apply_rules([form])
# 2. Return all that are in the database (and check the original too)
results = filter_forms([form] + forms)
if results:
return results
# 3. If there are no matches, keep applying rules until we find a match
while forms:
forms = apply_rules(forms)
results = filter_forms(forms)
if results:
return results
# Return an empty list if we can't find anything
return []
However if the word is in the list of exceptions, it will return a fixed value kept in the exceptions
, see _load_exception_map
in http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html:
def _load_exception_map(self):
# load the exception file data into memory
for pos, suffix in self._FILEMAP.items():
self._exception_map[pos] = {}
for line in self.open('%s.exc' % suffix):
terms = line.split()
self._exception_map[pos][terms[0]] = terms[1:]
self._exception_map[ADJ_SAT] = self._exception_map[ADJ]
Going back to your example, worse
-> bad
and further
-> far
CANNOT be achieved from the rules, thus it has to be from the exception list. Since it's an exception list, there are bound to be inconsistencies.
The exception list are kept in ~/nltk_data/corpora/wordnet/adv.exc
and ~/nltk_data/corpora/wordnet/adv.exc
.
From adv.exc
:
best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard
From adj.exc
:
...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...
这篇关于Python NLTK用wordnet对“进一步"一词的词法化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!