如何将自定义y语添加到spaCy的norm_exceptions.py模块中? [英] How to add custom slangs into spaCy's norm_exceptions.py module?

查看:82
本文介绍了如何将自定义y语添加到spaCy的norm_exceptions.py模块中?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SpaCy的文档在此处中提供了一些有关添加新some语的信息.. >

但是,我想知道:

(1)我什么时候应该调用以下函数?

lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS)

根据入门指南此处,spaCy的典型用法如下: :

import spacy
nlp = spacy.load('en')
# Should I call the function add_lookups(...) here?
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

(2)在处理管道中何时处理规范异常?

我假设这样一个典型的管道:tokenizer->标记器->解析器-> ner.

在分词器之前是否处理过规范异常?并且,规范例外组件相对于其他预处理组件(如停用词,词条消除器)的组织方式(请参阅组件的完整列表解决方案

规范异常是语言数据的一部分,而属性getter(获取文本并返回规范的函数)是使用语言类初始化的,例如English.您可以在这里.这一切都发生在构建管道之前.

此处的假设是,规范异常通常是特定于语言的,因此应在语言数据中定义,而与处理管道无关.规范也是词法属性,因此它们的获取者生活在基础词素上,即词汇表中与上下文无关的条目(与令牌(上下文中的单词)相反).

但是,关于token.norm_的好处是它是可写的–因此,您可以轻松添加,以确保它们都收到相似的表示,即使其中之一在训练数据中的使用频率较低.如果您的自定义组件现在用"Euro"覆盖"€",这也会对模型的预测产生影响.因此,您可能会发现MONEY实体的预测不准确.

如果您打算训练自己的模型时考虑到您的自定义规范,则可能需要考虑实现自定义语言子类.另外,如果您认为要添加的语默认情况下应包含在spaCy中,则始终可以提交here.

However, I'd like to know:

(1) When should I call the following function?

lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS)

The typical usage of spaCy, according to the introduction guide here, is something as follows:

import spacy
nlp = spacy.load('en')
# Should I call the function add_lookups(...) here?
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

(2) When in the processing pipeline are norm exceptions handled?

I'm assuming a typical pipeline as such: tokenizer -> tagger -> parser -> ner.

Are norm exceptions handled right before the tokenizer? And also, how is the norm exceptions component organized with respect to the other pre-processing components such as stop words, lemmatizer (see full list of components here)? What comes before what?


Am new to spaCy and much help would be appreciated. Thanks!

解决方案

The norm exceptions are part of the language data and the attribute getter (the function that takes a text and returns the norm), is initialised with the language class, e.g. English. You can see an example of this here. This all happens before the pipeline is even constructed.

The assumption here is that the norm exceptions are usually language-specific and should thus be defined in the language data, independent of the processing pipeline. Norms are also lexical attributes, so their getters live on the underlying lexeme, the context-insensitive entry in the vocabulary (as opposed to a token, which is the word in context).

However, the nice thing about the token.norm_ is that it's writeable – so you can easily add a custom pipeline component that looks up the token's text in your own dictionary, and overwrites the norm if necessary:

def add_custom_norms(doc):
    for token in doc:
        if token.text in YOUR_NORM_DICT:
            token.norm_ = YOUR_NORM_DICT[token.text]
    return doc

nlp.add_pipe(add_custom_norms, last=True)

Keep in mind that the NORM attribute is also used as a feature in the model, so depending on the norms you want to add or overwrite, you might want to only apply your custom component after the tagger, parser or entity recognizer is called.

For example, by default, spaCy normalises all currency symbols to "$" to ensure that they all receive similar representations, even if one of them is less frequent in the training data. If your custom component now overwrites "€" with "Euro", this will also have an impact on the model's predictions. So you might see less accurate predictions for MONEY entities.

If you're planning on training your own model that takes your custom norms into account, you might want to consider implementing a custom language subclass. Alternatively, if you think that the slang terms you want to add should be included in spaCy by default, you can always submit a pull request, for example to the English norm_exceptions.py.

这篇关于如何将自定义y语添加到spaCy的norm_exceptions.py模块中?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆