Python NLTK用wordnet对“进一步"一词的词法化 [英] Python NLTK Lemmatization of the word 'further' with wordnet

查看:68
本文介绍了Python NLTK用wordnet对“进一步"一词的词法化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用python,NLTK和WordNetLemmatizer进行lemmatizer. 这是输出我期望的随机文本

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective

输出:'bad'

lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb

输出:'worse'

好吧,这里的一切都很好.其行为与其他形容词如'better'(对于不规则形式)或'older'相同(请注意,与'elder'相同的测试将永远不会输出'old',但我想wordnet并不是详尽的列表)所有现有的英语单词)

当我尝试使用单词'furter'时,我的问题来了:

lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective

输出:'further'

lem.lemmatize('further', pos=wordnet.ADV) // as an adverb

输出:'far'

这与'worse'单词的行为完全相反!

有人能解释我为什么吗?是来自wordnet同义词集数据的bug,还是我对英语语法的误解?

如果这个问题已经回答,请原谅,我已经在google和SO上进行搜索,但是当指定关键字"further"时,由于该词的流行,我发现除了乱七八糟之外,其他任何相关的东西……

预先感谢您, 罗曼·G.

解决方案

WordNetLemmatizer使用._morphy函数访问其单词的引理.来自 http://www.nltk.org/_modules/nltk/stem/wordnet. html 并以最小长度返回可能的引理.

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

._morphy函数迭代地应用规则以获得引理.规则会不断减少单词的长度,并用MORPHOLOGICAL_SUBSTITUTIONS代替词缀.然后查看是否有其他单词较短但与简化单词相同:

def _morphy(self, form, pos):
    # from jordanbg:
    # Given an original string x
    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    # 2. Return all that are in the database
    # 3. If there are no matches, keep applying rules until you either
    #    find a match or you can't go any further

    exceptions = self._exception_map[pos]
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

    def apply_rules(forms):
        return [form[:-len(old)] + new
                for form in forms
                for old, new in substitutions
                if form.endswith(old)]

    def filter_forms(forms):
        result = []
        seen = set()
        for form in forms:
            if form in self._lemma_pos_offset_map:
                if pos in self._lemma_pos_offset_map[form]:
                    if form not in seen:
                        result.append(form)
                        seen.add(form)
        return result

    # 0. Check the exception lists
    if form in exceptions:
        return filter_forms([form] + exceptions[form])

    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    forms = apply_rules([form])

    # 2. Return all that are in the database (and check the original too)
    results = filter_forms([form] + forms)
    if results:
        return results

    # 3. If there are no matches, keep applying rules until we find a match
    while forms:
        forms = apply_rules(forms)
        results = filter_forms(forms)
        if results:
            return results

    # Return an empty list if we can't find anything
    return []

但是,如果单词在例外列表中,它将返回保存在exceptions中的固定值,请参见_load_exception_map /corpus/reader/wordnet.html"rel =" nofollow> http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html :

def _load_exception_map(self):
    # load the exception file data into memory
    for pos, suffix in self._FILEMAP.items():
        self._exception_map[pos] = {}
        for line in self.open('%s.exc' % suffix):
            terms = line.split()
            self._exception_map[pos][terms[0]] = terms[1:]
    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

回到您的示例,不能从规则中实现worse-> badfurther-> far,因此必须从异常列表中获得.由于这是例外列表,因此肯定会有不一致之处.

例外列表保存在~/nltk_data/corpora/wordnet/adv.exc~/nltk_data/corpora/wordnet/adv.exc中.

来自adv.exc:

best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard

来自adj.exc:

...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...

I'm working on a lemmatizer using python, NLTK and the WordNetLemmatizer. Here is a random text that output what I was expecting

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective

Output: 'bad'

lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb

Output: 'worse'

Well, everything here is fine. The behaviour is the same with other adjectives like 'better' (for an irregular form) or 'older' (note that the same test with 'elder' will never output 'old', but I guess that wordnet is not an exhaustive list of all the existing english word)

My question comes when trying with the word 'furter':

lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective

Output: 'further'

lem.lemmatize('further', pos=wordnet.ADV) // as an adverb

Output: 'far'

This is the exact opposite behaviour of the one for the 'worse' word!

Can anybody explain me why ? Is it a bug coming from the wordnet synsets data or does it come from my misunderstanding of the english grammar ?

Please excuse me if the question is already answered, I've search on google and SO, but when specifying the keyword "further", I can find anything related but mess because of the popularity of this word...

Thank you in advance, Romain G.

解决方案

WordNetLemmatizer uses the ._morphy function to access its a word's lemma; from http://www.nltk.org/_modules/nltk/stem/wordnet.html and returns the possible lemmas with the minimum length.

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

And the ._morphy function apply rules iteratively to get a lemma; the rules keep reducing the length of the word and substituting the affixes with the MORPHOLOGICAL_SUBSTITUTIONS. then it sees whether there are other words that are shorter but the same as the reduced word:

def _morphy(self, form, pos):
    # from jordanbg:
    # Given an original string x
    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    # 2. Return all that are in the database
    # 3. If there are no matches, keep applying rules until you either
    #    find a match or you can't go any further

    exceptions = self._exception_map[pos]
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

    def apply_rules(forms):
        return [form[:-len(old)] + new
                for form in forms
                for old, new in substitutions
                if form.endswith(old)]

    def filter_forms(forms):
        result = []
        seen = set()
        for form in forms:
            if form in self._lemma_pos_offset_map:
                if pos in self._lemma_pos_offset_map[form]:
                    if form not in seen:
                        result.append(form)
                        seen.add(form)
        return result

    # 0. Check the exception lists
    if form in exceptions:
        return filter_forms([form] + exceptions[form])

    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    forms = apply_rules([form])

    # 2. Return all that are in the database (and check the original too)
    results = filter_forms([form] + forms)
    if results:
        return results

    # 3. If there are no matches, keep applying rules until we find a match
    while forms:
        forms = apply_rules(forms)
        results = filter_forms(forms)
        if results:
            return results

    # Return an empty list if we can't find anything
    return []

However if the word is in the list of exceptions, it will return a fixed value kept in the exceptions, see _load_exception_map in http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html:

def _load_exception_map(self):
    # load the exception file data into memory
    for pos, suffix in self._FILEMAP.items():
        self._exception_map[pos] = {}
        for line in self.open('%s.exc' % suffix):
            terms = line.split()
            self._exception_map[pos][terms[0]] = terms[1:]
    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

Going back to your example, worse -> bad and further -> far CANNOT be achieved from the rules, thus it has to be from the exception list. Since it's an exception list, there are bound to be inconsistencies.

The exception list are kept in ~/nltk_data/corpora/wordnet/adv.exc and ~/nltk_data/corpora/wordnet/adv.exc.

From adv.exc:

best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard

From adj.exc:

...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...

这篇关于Python NLTK用wordnet对“进一步"一词的词法化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆