Spacy-保存自定义管道 [英] Spacy - Save custom pipeline

查看:179
本文介绍了Spacy-保存自定义管道的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将自定义PhraseMatcher()组件集成到我的nlp管道中,以允许我加载自定义Spacy模型,而不必在每次加载时将我的自定义组件重新添加到通用模型中. /p>

如何加载包含自定义管道组件的Spacy模型?

我创建了该组件,将其添加到我的管道中,并使用以下命令进行保存:

import requests
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token

class RESTCountriesComponent(object):
    name = 'countries'
    def __init__(self, nlp, label='GPE'):
        self.countries = [u'MyCountry', u'MyOtherCountry']
        self.label = nlp.vocab.strings[label]
        patterns = [nlp(c) for c in self.countries]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('COUNTRIES', None, *patterns)        
    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for _, start, end in matches:
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
        doc.ents = list(doc.ents) + spans
        for span in spans:
            span.merge()
        return doc

nlp = English()
rest_countries = RESTCountriesComponent(nlp)
nlp.add_pipe(rest_countries)
nlp.to_disk('myNlp')

然后我尝试用加载我的模型,

nlp = spacy.load('myNlp')

但是收到此错误消息:

KeyError:u" [E002]找不到国家/地区"的工厂.通常这是 spaCy使用组件名称调用nlp.create_pipe时发生 那不是内置的-例如,当从 模型的meta.json.如果您使用的是自定义组件,则可以编写 到Language.factories['countries']或将其从模型元中删除 并通过nlp.add_pipe添加它."

我不能只将自己的自定义组件添加到编程环境中的通用管道中.我该怎么办?

解决方案

保存模型后,spaCy将序列化所有数据并将对管道的引用存储在模型的meta.json中.例如:["ner", "countries"].重新加载模型时,spaCy将检出元数据并通过在所谓的工厂"中查找每个管道组件来对其进行初始化:这些函数告诉spaCy如何构造管道组件. (这样做的原因是,当您重新加载模型时,通常不希望模型存储和评估任意代码-至少默认情况下不这样做.)

在您的情况下,spaCy试图在工厂中查找组件名称'countries'并失败,因为它不是内置的.但是,Language.factories是一个简单的字典,因此您可以自定义它并添加自己的条目:

from spacy.language import Language
Language.factories['countries'] = lambda nlp, **cfg: RESTCountriesComponent(nlp, **cfg)

工厂是一个函数,用于接收共享的nlp对象和可选的关键字参数(配置参数).然后初始化组件并返回它.如果您在加载模型之前 添加了以上代码,则应该按预期加载.

更高级的方法

如果您希望自动处理此问题,则还可以随模型一起运送组件 .这需要使用 spacy package 命令将其包装为Python包,该命令会创建所有必需的Python文件.默认情况下,__init__.py仅包含一个用于加载模型的函数-但您也可以向其添加自定义函数或使用它向spaCy的工厂添加条目.

v2.1.0起(当前可作为每晚版本进行测试),spaCy还将支持通过Python入口点提供管道组件工厂.这对于生产设置和/或要模块化单个组件并将其拆分为自己的程序包特别有用.例如,您可以为国家(地区)组件及其工厂创建一个Python包,将其上传到PyPi,对其进行版本化并分别进行测试.您的软件包可以在其setup.py中定义其公开的spaCy工厂以及在何处找到它们. spaCy将能够自动检测到它们-您所需要做的就是将软件包安装在同一环境中.您的模型包甚至可能需要组件包作为依赖项,因此在安装模型时会自动安装它.

I'm trying to integrate a custom PhraseMatcher() component into my nlp pipeline in a way that will allow me to load the custom Spacy model without having to re-add my custom components to a generic model on each load.

How can I load a Spacy model containing custom pipeline components?

I create the component, add it to my pipeline and save it with the following:

import requests
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc, Span, Token

class RESTCountriesComponent(object):
    name = 'countries'
    def __init__(self, nlp, label='GPE'):
        self.countries = [u'MyCountry', u'MyOtherCountry']
        self.label = nlp.vocab.strings[label]
        patterns = [nlp(c) for c in self.countries]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add('COUNTRIES', None, *patterns)        
    def __call__(self, doc):
        matches = self.matcher(doc)
        spans = []
        for _, start, end in matches:
            entity = Span(doc, start, end, label=self.label)
            spans.append(entity)
        doc.ents = list(doc.ents) + spans
        for span in spans:
            span.merge()
        return doc

nlp = English()
rest_countries = RESTCountriesComponent(nlp)
nlp.add_pipe(rest_countries)
nlp.to_disk('myNlp')

I then attempt to load my model with,

nlp = spacy.load('myNlp')

But get this error message:

KeyError: u"[E002] Can't find factory for 'countries'. This usually happens when spaCy calls nlp.create_pipe with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to Language.factories['countries'] or remove it from the model meta and add it via nlp.add_pipe instead."

I can't just add my custom components to a generic pipeline in my programming environment. How can I do what I'm trying to do?

解决方案

When you save out your model, spaCy will serialize all data and store a reference to your pipeline in the model's meta.json. For example: ["ner", "countries"]. When you load your model back in, spaCy will check out the meta and initialise each pipeline component by looking it up in the so-called "factories": functions that tell spaCy how to construct a pipeline component. (The reason for that is that you usually don't want your model to store and eval arbitrary code when you load it back in – at least not by default.)

In your case, spaCy is trying to look up the component name 'countries' in the factories and fails, because it's not built-in. The Language.factories are a simple dictionary, though, so you can customise it and add your own entries:

from spacy.language import Language
Language.factories['countries'] = lambda nlp, **cfg: RESTCountriesComponent(nlp, **cfg)

A factory is a function that receives the shared nlp object and optional keyword arguments (config parameters). It then initialises the component and returns it. If you add the above code before you load your model, it should load as expected.

More advanced approaches

If you want this taken care of automatically, you could also ship your component with your model. This requires wrapping it as a Python package using the spacy package command, which creates all required Python files. By default, the __init__.py only includes a function to load your model – but you can also add custom functions to it or use it to add entries to spaCy's factories.

As of v2.1.0 (currently available as a nightly version for testing), spaCy will also support providing pipeline component factories via Python entry points. This is especially useful for production setups and/or if you want to modularise your individual components and split them into their own packages. For example, you could create a Python package for your countries component and its factory, upload it to PyPi, version it and test it separately. In its setup.py, your package can define the spaCy factories it exposes and where to find them. spaCy will be able to detect them automatically – all you need to do is install the package in the same environment. Your model package could even require your component package as a dependency so it's installed automatically when you install your model.

这篇关于Spacy-保存自定义管道的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆