spaCy nlp管道的操作顺序 [英] spaCy nlp pipeline order of operations

查看:273
本文介绍了spaCy nlp管道的操作顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否有人按时间顺序列出了执行的操作

Does anyone have a chronological list of operations performed by

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

我可以看到nlp.pipe_names

['tagger', 'parser', 'ner']

以及按字母顺序列出的带有nlp.factories

and an alphabetical list of factory operations with nlp.factories

{'merge_entities': <function spacy.language.Language.<lambda>>,
 'merge_noun_chunks': <function spacy.language.Language.<lambda>>,
 'ner': <function spacy.language.Language.<lambda>>,
 'parser': <function spacy.language.Language.<lambda>>,
 'sbd': <function spacy.language.Language.<lambda>>,
 'sentencizer': <function spacy.language.Language.<lambda>>,
 'similarity': <function spacy.language.Language.<lambda>>,
 'tagger': <function spacy.language.Language.<lambda>>,
 'tensorizer': <function spacy.language.Language.<lambda>>,
 'textcat': <function spacy.language.Language.<lambda>>,
 'tokenizer': <function spacy.language.Language.<lambda>>}

但是我不知道何时调用 lemmatizer . 必须进行令牌 POS标记后才能进行合法化,并且它将在禁用 parser ner 的情况下运行. spaCy 管道文档完全没有提及.谢谢!

but I can't figure out when the lemmatizer is invoked. Lemmatization has to happen after tokenization and POS tagging, and it will run with the parser and ner disabled. The spaCy pipeline docs don't mention it at all. Thanks!

推荐答案

您的问题的答案比我原先想的要复杂,但是现在我将详细解释.

The answer to your question is more complicated than I originally thought, but now I will explain it in detail.

SpaCy lemmatization通常基于查找表执行.这意味着它独立于管道组件,并且在管道之前进行了词根化.但是,英语和希腊语经过设计,以便在pos标签可用时可以执行基于规则的词条化.这意味着,如果启用了标记器,那么我们可以利用POS标签,以便根据其标签找到与单词匹配的最佳词条.在这种情况下,会在标记程序流水线组件之后进行lemmatization.

SpaCy lemmatization usually is performed based on a lookup table. That means that is independent on the pipeline components and it lemmatization happens before the pipe. However, English language and Greek language are designed such that a rule based lemmatization can be performed when pos tag is available. That means that if tagger is enabled then we can take advantage of the POS tag in order to find the best lemma matching the word based on its' tag. In this case, lemmatization happens just after the tagger pipeline component.

简而言之,如果禁用了标记器,则我们将基于一个基于查找表的静态lemmatization程序,该查找表将单词与它们的lemms相匹配,并且lemmatization发生在任何流水线组件之前.与此相反,启用标记器时,词条抽取过程基于规则,并取决于POS标签,因此它发生在标记器之后.我再说一遍,这种情况只会发生在某些支持基于规则的词法化的语言中,例如英语和希腊语.

Briefly, if tagger is disabled the we follow a static lemmatization procedure based on a lookup table that matches words to their lemmas and lemmatization happens before any pipeline component. Contrary to that, when tagger is enabled the lemmatization procedure is rule based and dependent on the POS tag, so it happens after tagger. I repeat that this case can happen only for certain languages that support rule based lemmatization such as English and Greek language.

代码示例:

import spacy
nlp = spacy.load('en')
nlp.remove_pipe('parser')
# uncommenting the following line means we go to rule based lemmatization
# nlp.remove_pipe('tagger')
nlp.remove_pipe('ner')
doc = nlp('those are random words')
for token in doc:
    print(token.lemma_)

注释掉一行的输出:那些是随机单词

Output with line commented out: those be random word

输出时没有注释的行:这是随机词

Output with line without comment: that be random word

希望现在已经澄清.

这篇关于spaCy nlp管道的操作顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆