spacy lemmatizer 如何工作? [英] How does spacy lemmatizer works?

查看:30
本文介绍了spacy lemmatizer 如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于词形还原,spacy 有一个单词列表:形容词, 副词, 动词... 还列出了例外情况: adverbs_irreg... 对于常规的,有一组 规则

让我们以更宽"这个词为例

因为它是一个形容词,所以词形还原的规则应该来自这个列表:

ADJECTIVE_RULES = [["呃", ""],[美东时间", ""],["er", "e"],["est", "e"]]

据我所知,这个过程是这样的:

1) 获取单词的词性标签,判断是名词还是动词...
2) 如果该词在不规则情况列表中,如果没有应用任何规则,则直接替换.

现在,如何决定使用er"->e"而不是er"->"来获得wide"而不是wid"?

这里可以测试.

解决方案

让我们从类定义开始:https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

班级

从初始化 3 个变量开始:

类词形还原(对象):@类方法def load(cls, path, index=None, exc=None, rules=None):返回 cls(索引或 {},exc 或 {},规则或 {})def __init__(self, index, exceptions, rules):self.index = 索引self.exc = 异常self.rules = 规则

现在,查看英文的 self.exc,我们看到它指向 https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py 从目录 https://github 加载文件的位置.com/explosion/spaCy/tree/master/spacy/en/lemmatizer

为什么 Spacy 不直接读取文件?

很可能是因为在代码中声明字符串比通过 I/O 流式传输字符串更快.

<小时>

这些索引、例外和规则从何而来?

仔细一看,好像都来自原始的普林斯顿WordNet https:///wordnet.princeton.edu/man/wndb.5WN.html

规则

再仔细看看,https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py 类似于 nltk 的 _morphy 规则 https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749

这些规则最初来自 Morphy 软件 https://wordnet.princeton.edu/man/morphy.7WN.html

此外,spacy 包含了一些不是来自普林斯顿莫菲的标点规则:

PUNCT_RULES = [[""", """],[""", """],["u2018", "'"],["u2019", "'"]]

例外

至于异常,它们存储在spacy*_irreg.py文件中,看起来它们也来自Princeton Wordnet.

如果我们查看原始 WordNet .exc(排除)文件(例如 https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc) 并且如果您从 nltk<下载 wordnet 包/code>,我们看到它是同一个列表:

alvas@ubi:~/nltk_data/corpora/wordnet$ lsadj.exc cntlist.rev data.noun index.adv index.verb noun.excadv.exc data.adj data.verb index.noun lexnames READMEcitation.bib data.adv index.adj index.sense LICENSE verb.excalvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc第1490章

索引

如果我们查看 spacy lemmatizer 的 index,我们会发现它也来自 Wordnet,例如https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.pynltk 中重新分发的 wordnet 副本:

alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj1 本软件和数据库由您(被许可人)提供2 普林斯顿大学获得以下许可.通过获取,使用3 和/或复制本软件和数据库,您同意您拥有4 阅读、理解并遵守这些条款和条件.:56 使用、复制、修改和分发本软件的许可和7 数据库及其文档用于任何目的且免费或特此授予 8 版税,前提是您同意遵守9 以下版权声明和声明,包括免责声明,10 并且同样出现在软件、数据库和11 个文档,包括您为内部所做的修改12 使用或分发.1314 WordNet 3.0 版权所有 2006 普林斯顿大学.版权所有.1516 本软件和数据库按原样"提供,Princeton17 大学不作任何陈述或保证,明示或18 暗示.举例来说,但不限于,普林斯顿19 大学不对商家做出任何陈述或保证——20 任何特定目的的能力或适合性或使用21 的许可软件、数据库或文档将不会22 侵犯任何第三方专利、版权、商标或23 其他权利.2425 普林斯顿大学或普林斯顿的名称不得用于26 与软件分发有关的广告或宣传27 和/或数据库.本软件、数据库和28 任何相关文件应始终与29 普林斯顿大学和被许可人同意保留相同内容.00001740 00 a 01 能够 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 !00002098 一 0101 |(通常后接to")具有做某事所需的手段或技能或诀窍或权限;会游泳";她能够为她的电脑编程";我们终于可以买一辆车了";能够获得该项目的资助"00002098 00 a 01 无法 0 002 = 05200169 n 0000 !00001740 一个 0101 |(通常后跟to")没有必要的手段或技能或诀窍;没有车就无法进城";无法获得资金"00002312 00 a 02 背面 0 背侧 4 002 ;c 06037666 n 0000 !00002527 一 0101 |背离器官或生物体的轴;叶子的背面是背向茎的底面或侧面"00002527 00 a 02 近轴 0 腹侧 4 002 ;c 06037666 n 0000 !00002312 一 0101 |最靠近或面向器官或生物体的轴;叶子的上侧被称为近轴面"00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 !00002843 一 0101 |面向或在朝向顶点的一侧00002843 00 a 01 basiccopic 0 002 ;c 06066555 n 0000 !00002730 一 0101 |面向或在面向底座的一侧00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 !00003131 一 0101 |尤其是肌肉;远离身体的中线或相邻的部分00003131 00 a 03 加合物 0 加合物 0 加合物 0 003 ;c 06080522 n 0000 + 01449236 v 0201 !00002956 一 0101 |尤其是肌肉;向身体的中线或相邻的部分聚集或拉动00003356 00 a 01 新生 0 005 + 07320302 n 0103 !00003939 一 0101 &00003553 一 0000 &00003700 一 0000 &00003829 一个 0000 |出生或开始;新生的小鸡";新生的叛乱"00003553 00 s 02 紧急 0 新兴 0 003 &00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 |应运而生;一个新兴的共和国"00003700 00 s 01 不灵活 0 002 &00003356 a 0000 + 07434782 n 0101 |用力爆开,一些成熟的种子容器也是如此

<小时>

基于 spacy lemmatizer 使用的字典、例外和规则主要来自普林斯顿 WordNet 及其 Morphy 软件,我们可以继续查看 spacy 使用索引和例外来应用规则.

我们回到https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

主要操作来自函数而不是Lemmatizer类:

def lemmatize(string, index, exceptions, rules):string = string.lower()形式 = []# TODO:这是正确的吗?请参阅问题 #435 中的讨论.#if 字符串在索引中:# forms.append(string)forms.extend(exceptions.get(string, []))oov_forms = []对于旧的,新的规则:如果 string.endswith(old):form = string[:len(string) - len(old)] + new如果不是形式:经过elif 形式是否在索引中 form.isalpha():表格.附加(表格)别的:oov_forms.append(form)如果不是表格:形式.扩展(oov_forms)如果不是表格:形式.附加(字符串)返回集(表格)

为什么 lemmatize 方法在 Lemmatizer 类之外?

我不确定,但也许是为了确保可以在类实例之外调用词形还原函数,但鉴于 @staticmethod@classmethod 存在也许还有其他考虑为什么函数和类已经解耦了

莫菲 vs 史派西

spacy lemmatize() 函数与 morphy() nltk 函数(最初来自 http://blog.osteele.com/2004/04/pywordnet-20/ 十多年前创建),morphy(),Oliver Steele 的 WordNet morphy 的 Python 移植主要流程是:

  1. 检查例外列表
  2. 对输入应用一次规则以获得 y1、y2、y3 等.
  3. 返回数据库中的所有内容(并检查原件)
  4. 如果没有匹配项,则继续应用规则,直到找到匹配项
  5. 如果我们找不到任何东西,则返回一个空列表

对于 spacy,它可能仍在开发中,因为 TODO 位于 https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76

不过大致流程好像是:

  1. 查找异常,如果词在异常列表中,则从异常列表中获取引理.
  2. 应用规则
  3. 保存索引列表中的那些
  4. 如果步骤 1-3 中没有引理,则只需跟踪词汇外单词 (OOV) 并将原始字符串附加到引理形式
  5. 返回引理形式

在 OOV 处理方面,如果没有找到词形还原形式,spacy 将返回原始字符串,在这方面,morphynltk 实现也是如此,例如

<预><代码>>>>从 nltk.stem 导入 WordNetLemmatizer>>>wnl = WordNetLemmatizer()>>>wnl.lemmatize('alvations')'alvations'

词形还原前检查不定式

另一个不同点可能是 morphyspacy 如何决定分配给单词的词性.在这方面,spacy 提出了一些语言学Lemmatizer() 中的规则来决定一个词是否是基本形式,如果该词已经是不定式形式 (is_base_form()),则完全跳过词形还原,这将节省很多如果对语料库中的所有单词进行词形还原,其中相当一部分是不定式(已经是词形形式).

但这在 spacy 中是可能的,因为它允许词形还原器访问与某些形态规则密切相关的 POS.而对于 morphy,虽然可以使用细粒度的 PTB POS 标签找出一些形态,但仍然需要一些努力来对它们进行分类以了解哪些形式是不定式的.

一般来说,词性标签中需要梳理出形态特征的 3 个主要信号:

  • 数量
  • 性别
<小时>

更新

SpaCy 在最初的回答(17 年 5 月 12 日)之后确实对他们的词形还原法进行了更改.我认为目的是在没有查找和规则处理的情况下使词形还原更快.

所以他们预先词形化单词并将它们留在查找哈希表中,以便对他们预先词形化的单词进行检索 O(1) https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py

此外,为了统一跨语言的词形还原器,词形还原器现在位于 https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92

但是上面讨论的基本词形还原步骤仍然与当前的 spacy 版本相关 (4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)

<小时>

结语

我想现在我们知道它适用于语言学规则等等,另一个问题是是否有任何非基于规则的词形还原方法?"

但在回答之前的问题之前,引理到底是什么?"可能更好的问题要问.

For lemmatization spacy has a lists of words: adjectives, adverbs, verbs... and also lists for exceptions: adverbs_irreg... for the regular ones there is a set of rules

Let's take as example the word "wider"

As it is an adjective the rule for lemmatization should be take from this list:

ADJECTIVE_RULES = [
    ["er", ""],
    ["est", ""],
    ["er", "e"],
    ["est", "e"]
] 

As I understand the process will be like this:

1) Get the POS tag of the word to know whether it is a noun, a verb...
2) If the word is in the list of irregular cases is replaced directly if not one of the rules is applied.

Now, how is decided to use "er" -> "e" instead of "er"-> "" to get "wide" and not "wid"?

Here it can be tested.

解决方案

Let's start with the class definition: https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

Class

It starts off with initializing 3 variables:

class Lemmatizer(object):
    @classmethod
    def load(cls, path, index=None, exc=None, rules=None):
        return cls(index or {}, exc or {}, rules or {})

    def __init__(self, index, exceptions, rules):
        self.index = index
        self.exc = exceptions
        self.rules = rules

Now, looking at the self.exc for english, we see that it points to https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py where it's loading files from the directory https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer

Why don't Spacy just read a file?

Most probably because declaring the string in-code is faster that streaming strings through I/O.


Where does these index, exceptions and rules come from?

Looking at it closely, they all seem to come from the original Princeton WordNet https://wordnet.princeton.edu/man/wndb.5WN.html

Rules

Looking at it even closer, the rules on https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py is similar to the _morphy rules from nltk https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749

And these rules originally comes from the Morphy software https://wordnet.princeton.edu/man/morphy.7WN.html

Additionally, spacy had included some punctuation rules that isn't from Princeton Morphy:

PUNCT_RULES = [
    [""", """],
    [""", """],
    ["u2018", "'"],
    ["u2019", "'"]
]

Exceptions

As for the exceptions, they were stored in the *_irreg.py files in spacy, and they look like they also come from the Princeton Wordnet.

It is evident if we look at some mirror of the original WordNet .exc (exclusion) files (e.g. https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc) and if you download the wordnet package from nltk, we see that it's the same list:

alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc       cntlist.rev  data.noun  index.adv    index.verb  noun.exc
adv.exc       data.adj     data.verb  index.noun   lexnames    README
citation.bib  data.adv     index.adj  index.sense  LICENSE     verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc 
1490 adj.exc

Index

If we look at the spacy lemmatizer's index, we see that it also comes from Wordnet, e.g. https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py and the re-distributed copy of wordnet in nltk:

alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj 

  1 This software and database is being provided to you, the LICENSEE, by  
  2 Princeton University under the following license.  By obtaining, using  
  3 and/or copying this software and database, you agree that you have  
  4 read, understood, and will comply with these terms and conditions.:  
  5   
  6 Permission to use, copy, modify and distribute this software and  
  7 database and its documentation for any purpose and without fee or  
  8 royalty is hereby granted, provided that you agree to comply with  
  9 the following copyright notice and statements, including the disclaimer,  
  10 and that the same appear on ALL copies of the software, database and  
  11 documentation, including modifications that you make for internal  
  12 use or for distribution.  
  13   
  14 WordNet 3.0 Copyright 2006 by Princeton University.  All rights reserved.  
  15   
  16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
  17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
  18 IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
  19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
  20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
  21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
  22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
  23 OTHER RIGHTS.  
  24   
  25 The name of Princeton University or Princeton may not be used in  
  26 advertising or publicity pertaining to distribution of the software  
  27 and/or database.  Title to copyright in this software, database and  
  28 any associated documentation shall at all times remain with  
  29 Princeton University and LICENSEE agrees to preserve same.  
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"  
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"  
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"  
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"  
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex  
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base  
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part  
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part  
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 |  being born or beginning; "the nascent chicks"; "a nascent insurgency"   
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"  
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels  


On the basis that the dictionary, exceptions and rules that spacy lemmatizer uses is largely from Princeton WordNet and their Morphy software, we can move on to see the actual implementation of how spacy applies the rules using the index and exceptions.

We go back to the https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

The main action comes from the function rather than the Lemmatizer class:

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    # TODO: Is this correct? See discussion in Issue #435.
    #if string in index:
    #    forms.append(string)
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)

Why is the lemmatize method outside of the Lemmatizer class?

That I'm not exactly sure but perhaps, it's to ensure that the lemmatization function can be called outside of a class instance but given that @staticmethod and @classmethod exist perhaps there are other considerations as to why the function and class has been decoupled

Morphy vs Spacy

Comparing spacy lemmatize() function against the morphy() function in nltk (which originally comes from http://blog.osteele.com/2004/04/pywordnet-20/ created more than a decade ago), morphy(), the main processes in Oliver Steele's Python port of the WordNet morphy are:

  1. Check the exception lists
  2. Apply rules once to the input to get y1, y2, y3, etc.
  3. Return all that are in the database (and check the original too)
  4. If there are no matches, keep applying rules until we find a match
  5. Return an empty list if we can't find anything

For spacy, possibly, it's still under development, given the TODO at line https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76

But the general process seems to be:

  1. Look for the exceptions, get them if the lemma from the exception list if the word is in it.
  2. Apply the rules
  3. Save the ones that are in the index lists
  4. If there are no lemma from step 1-3, then just keep track of the Out-of-vocabulary words (OOV) and also append the original string to the lemma forms
  5. Return the lemma forms

In terms of OOV handling, spacy returns the original string if no lemmatized form is found, in that respect, the nltk implementation of morphy does the same,e.g.

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'

Checking for infinitive before lemmatization

Possibly another point of difference is how morphy and spacy decides what POS to assign to the word. In that respect, spacy puts some linguistics rule in the Lemmatizer() to decide whether a word is the base form and skips the lemmatization entirely if the word is already in the infinitive form (is_base_form()), this will save quite a bit if lemmatization was to be done for all words in the corpus and quite a chunk of it are infinitives (already the lemma form).

But that's possible in spacy because it allowed the lemmatizer to access the POS that's tied closely to some morphological rules. While for morphy although it's possible to figure out some morphology using the fine-grained PTB POS tags, it still takes some effort to sort them out to know which forms are infinitive.

Generalment, the 3 primary signals of morphology features needs to be teased out in the POS tag:

  • person
  • number
  • gender

Updated

SpaCy did make changes to their lemmatizer after the initial answer (12 May 17). I think the purpose was to make the lemmatization faster without look-ups and rules processing.

So they pre-lemmatize words and leave them in a lookup hash-table to make the retrieval O(1) for words that they have pre-lemmatized https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py

Also, in efforts to unify the lemmatizers across languages, the lemmatizer is now located at https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92

But the underlying lemmatization steps discussed above is still relevant to the current spacy version (4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)


Epilogue

I guess now that we know it works with linguistics rules and all, the other question is "are there any non rule-based methods for lemmatization?"

But before even answering the question before, "What exactly is a lemma?" might the better question to ask.

这篇关于spacy lemmatizer 如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆