spacy lemmatizer如何工作? [英] How does spacy lemmatizer works?

查看:79
本文介绍了spacy lemmatizer如何工作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于词形匹配,spacy具有单词列表:形容词,副词,动词...以及例外列表:adverbs_irreg ...对于常规的副词,有一组

For lemmatization spacy has a lists of words: adjectives, adverbs, verbs... and also lists for exceptions: adverbs_irreg... for the regular ones there is a set of rules

让我们以更广泛"一词为例

Let's take as example the word "wider"

因为这是一个形容词,所以必须从以下列表中选择词法的唯一化规则:

As it is an adjective the rule for lemmatization should be take from this list:

ADJECTIVE_RULES = [
    ["er", ""],
    ["est", ""],
    ["er", "e"],
    ["est", "e"]
] 

据我了解,该过程将如下所示:

As I understand the process will be like this:

1)获取单词的POS标记以知道它是否是名词,动词...
2)如果不适用任何规则,则直接替换不规则情况列表中的单词.

1) Get the POS tag of the word to know whether it is a noun, a verb...
2) If the word is in the list of irregular cases is replaced directly if not one of the rules is applied.

现在,如何决定使用"er"->"e"而不是"er"->"来获得宽"而不是假"?

Now, how is decided to use "er" -> "e" instead of "er"-> "" to get "wide" and not "wid"?

此处可以进行测试.

推荐答案

让我们从类定义开始:

Let's start with the class definition: https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

首先要初始化3个变量:

It starts off with initializing 3 variables:

class Lemmatizer(object):
    @classmethod
    def load(cls, path, index=None, exc=None, rules=None):
        return cls(index or {}, exc or {}, rules or {})

    def __init__(self, index, exceptions, rules):
        self.index = index
        self.exc = exceptions
        self.rules = rules

现在,查看self.exc的英语,我们看到它指向 https://github.com/explosion/spaCy/tree/develop/spacy/lang/zh-CN/lemmatizer/ init .py 所在的位置从目录 https://github.com/explosion/spaCy /tree/master/spacy/en/lemmatizer

Now, looking at the self.exc for english, we see that it points to https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py where it's loading files from the directory https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer

最可能是因为在代码中声明字符串比通过I/O流字符串更快.

Most probably because declaring the string in-code is faster that streaming strings through I/O.

仔细观察,它们似乎都来自原始的普林斯顿WordNet https: //wordnet.princeton.edu/man/wndb.5WN.html

Looking at it closely, they all seem to come from the original Princeton WordNet https://wordnet.princeton.edu/man/wndb.5WN.html

规则

更仔细地看,这些规则最初来自Morphy软件 https://wordnet. princeton.edu/man/morphy.7WN.html

And these rules originally comes from the Morphy software https://wordnet.princeton.edu/man/morphy.7WN.html

此外,spacy包括一些并非普林斯顿·莫菲(Princeton Morphy)的标点规则:

Additionally, spacy had included some punctuation rules that isn't from Princeton Morphy:

PUNCT_RULES = [
    [""", "\""],
    [""", "\""],
    ["\u2018", "'"],
    ["\u2019", "'"]
]

例外

关于例外情况,它们存储在spacy中的*_irreg.py文件中,它们看起来也来自普林斯顿Wordnet.

As for the exceptions, they were stored in the *_irreg.py files in spacy, and they look like they also come from the Princeton Wordnet.

很明显,如果我们看一下原始WordNet .exc(排除)文件的镜像(例如

It is evident if we look at some mirror of the original WordNet .exc (exclusion) files (e.g. https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc) and if you download the wordnet package from nltk, we see that it's the same list:

alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc       cntlist.rev  data.noun  index.adv    index.verb  noun.exc
adv.exc       data.adj     data.verb  index.noun   lexnames    README
citation.bib  data.adv     index.adj  index.sense  LICENSE     verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc 
1490 adj.exc

索引

如果我们查看spacy lemmatizer的index,我们会发现它也来自Wordnet,例如 https://github.com/explosion/spaCy/tree/develop/spacy/lang/zh-CN/lemmatizer/_adjectives.py nltk中的wordnet重新分发的副本:

If we look at the spacy lemmatizer's index, we see that it also comes from Wordnet, e.g. https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py and the re-distributed copy of wordnet in nltk:

alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj 

  1 This software and database is being provided to you, the LICENSEE, by  
  2 Princeton University under the following license.  By obtaining, using  
  3 and/or copying this software and database, you agree that you have  
  4 read, understood, and will comply with these terms and conditions.:  
  5   
  6 Permission to use, copy, modify and distribute this software and  
  7 database and its documentation for any purpose and without fee or  
  8 royalty is hereby granted, provided that you agree to comply with  
  9 the following copyright notice and statements, including the disclaimer,  
  10 and that the same appear on ALL copies of the software, database and  
  11 documentation, including modifications that you make for internal  
  12 use or for distribution.  
  13   
  14 WordNet 3.0 Copyright 2006 by Princeton University.  All rights reserved.  
  15   
  16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
  17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
  18 IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
  19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
  20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
  21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
  22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
  23 OTHER RIGHTS.  
  24   
  25 The name of Princeton University or Princeton may not be used in  
  26 advertising or publicity pertaining to distribution of the software  
  27 and/or database.  Title to copyright in this software, database and  
  28 any associated documentation shall at all times remain with  
  29 Princeton University and LICENSEE agrees to preserve same.  
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"  
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"  
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"  
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"  
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex  
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base  
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part  
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part  
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 |  being born or beginning; "the nascent chicks"; "a nascent insurgency"   
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"  
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels  


根据spacy lemmatizer使用的字典,异常和规则主要来自Princeton WordNet及其Morphy软件,我们可以继续查看spacy如何使用索引和规则应用规则的实际实现.例外.


On the basis that the dictionary, exceptions and rules that spacy lemmatizer uses is largely from Princeton WordNet and their Morphy software, we can move on to see the actual implementation of how spacy applies the rules using the index and exceptions.

我们回到 https://github.com/爆炸/spaCy/blob/develop/spacy/lemmatizer.py

主要动作来自函数而不是Lemmatizer类:

The main action comes from the function rather than the Lemmatizer class:

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    # TODO: Is this correct? See discussion in Issue #435.
    #if string in index:
    #    forms.append(string)
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)

为什么lemmatize方法不在Lemmatizer类之外?

我不确定,但也许是为了确保可以在类实例之外调用lemmatization函数,但是要考虑到

Why is the lemmatize method outside of the Lemmatizer class?

That I'm not exactly sure but perhaps, it's to ensure that the lemmatization function can be called outside of a class instance but given that @staticmethod and @classmethod exist perhaps there are other considerations as to why the function and class has been decoupled

spacy lemmatize()函数与 http ://blog.osteele.com/2004/04/pywordnet-20/创建于十多年前),morphy(),奥利弗·斯蒂尔(Oliver Steele)的WordNet词法的Python端口中的主要过程是:

Comparing spacy lemmatize() function against the morphy() function in nltk (which originally comes from http://blog.osteele.com/2004/04/pywordnet-20/ created more than a decade ago), morphy(), the main processes in Oliver Steele's Python port of the WordNet morphy are:

  1. 检查例外列表
  2. 将规则一次应用于输入以获取y1,y2,y3等.
  3. 返回数据库中的所有内容(并检查原始内容)
  4. 如果没有匹配项,请继续应用规则,直到找到匹配项
  5. 如果找不到任何内容,请返回一个空列表

对于spacy,可能可能仍在开发中,因为TODO位于

For spacy, possibly, it's still under development, given the TODO at line https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76

但是一般过程似乎是:

  1. 查找异常,如果单词中包含词条,则从异常列表中获取引理.
  2. 应用规则
  3. 保存索引列表中的内容
  4. 如果步骤1-3中没有引理,则只需跟踪词汇外词(OOV),然后将原始字符串附加到引理形式
  5. 返回引理形式

就OOV处理而言,如果未找到词根化形式,spacy将返回原始字符串,在这方面,morphynltk实现也是如此.

In terms of OOV handling, spacy returns the original string if no lemmatized form is found, in that respect, the nltk implementation of morphy does the same,e.g.

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'

在去词格化之前检查不定式

可能的另一个不同点是morphyspacy如何确定要分配给该单词的POS.在这方面, spacyLemmatizer()判断一个单词是否是基本形式,如果该单词已经是不定式形式(is_base_form()),则完全跳过该词的去词性化处理,如果要对所有词进行去词性化处理,这将节省很多时间语料中的很大一部分是不定式(已经是引理形式).

Checking for infinitive before lemmatization

Possibly another point of difference is how morphy and spacy decides what POS to assign to the word. In that respect, spacy puts some linguistics rule in the Lemmatizer() to decide whether a word is the base form and skips the lemmatization entirely if the word is already in the infinitive form (is_base_form()), this will save quite a bit if lemmatization was to be done for all words in the corpus and quite a chunk of it are infinitives (already the lemma form).

但是在spacy中这是可能的,因为它允许lemmatizer访问与某些形态规则紧密相关的POS.尽管对于morphy,虽然可以使用细粒度的PTB POS标签找出某些形态,但仍需要花费一些精力来整理它们,以了解哪些形式是不定式.

But that's possible in spacy because it allowed the lemmatizer to access the POS that's tied closely to some morphological rules. While for morphy although it's possible to figure out some morphology using the fine-grained PTB POS tags, it still takes some effort to sort them out to know which forms are infinitive.

总的来说,需要在POS标签中梳理出形态学特征的3个主要信号:

Generalment, the 3 primary signals of morphology features needs to be teased out in the POS tag:

  • 号码
  • 性别

SpaCy确实在初始答案(17年5月12日)之后更改了他们的lemmatizer.我认为这样做的目的是在不进行查找和规则处理的情况下,更快地完成词法归类.

SpaCy did make changes to their lemmatizer after the initial answer (12 May 17). I think the purpose was to make the lemmatization faster without look-ups and rules processing.

因此,他们对词进行预词法化,然后将其留在查找哈希表中,以对词已被预先词法化的词进行检索O(1)

So they pre-lemmatize words and leave them in a lookup hash-table to make the retrieval O(1) for words that they have pre-lemmatized https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py

此外,为了统一跨语言的lemmatizer,lemmatizer现在位于 https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92

Also, in efforts to unify the lemmatizers across languages, the lemmatizer is now located at https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92

但是上面讨论的基本词形化步骤仍然与当前的spacy版本(4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)相关

But the underlying lemmatization steps discussed above is still relevant to the current spacy version (4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc)

我想现在我们知道它适用于语言学规则以及所有其他问题,另一个问题是是否有任何非基于规则的词法化方法?"

I guess now that we know it works with linguistics rules and all, the other question is "are there any non rule-based methods for lemmatization?"

但在回答之前的问题之前,引理究竟是什么?"可能会问一个更好的问题.

But before even answering the question before, "What exactly is a lemma?" might the better question to ask.

这篇关于spacy lemmatizer如何工作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆