通过我自己的培训示例来培训spaCy现有的POS标记器 [英] Train spaCy's existing POS tagger with my own training examples

查看:268
本文介绍了通过我自己的培训示例来培训spaCy现有的POS标记器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在我自己的词典上训练现有的POS标记器,而不是从头开始(我不想创建空模型"). 在spaCy的文档中,显示加载要统计的模型",下一步是使用add_label方法将标签映射添加到标记器".但是,当我尝试加载英语小型模型并添加标签映射时,它会引发此错误:

I am trying to train the existing POS tagger on my own lexicon, not starting off from scratch (I do not want to create an "empty model"). In spaCy's documentation, it says "Load the model you want to stat with", and the next step is "Add the tag map to the tagger using add_label method". However, when I try to load the English small model, and add the tag map, it throws this error:

ValueError:[T003]当前不支持调整大小的预训练Tagger模型.

ValueError: [T003] Resizing pre-trained Tagger models is not currently supported.

我想知道如何修复它.

我还看到了实施在Spacy中针对现有英语模型:NLP-Python 定制了POS Tagger,但这表明我们创建了我不想要的空模型".

I have also seen Implementing custom POS Tagger in Spacy over existing english model : NLP - Python but it suggests that we create an "empty model" which is not what I want.

此外,即使我们的培训示例标签与通用依赖标签相同,在spaCy的文档中也不清楚我们是否需要映射字典(TAG_MAP).有什么想法吗?

Also, it is not very clear in spaCy's documentation if we need to have a mapping dictionary (TAG_MAP) even if our training examples tags are the same as the universal dependency tags. Any thoughts?

from __future__ import unicode_literals, print_function
import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding

TAG_MAP = {"noun": {"pos": "NOUN"}, "verb": {"pos": "VERB"}, "adj": {"pos": "ADJ"}, "adv": {"pos": "ADV"}}

TRAIN_DATA = [
    ('Afrotropical', {'tags': ['adj']}), ('Afrocentricity', {'tags': ['noun']}),
    ('Afrocentric', {'tags': ['adj']}), ('Afrocentrism', {'tags': ['noun']}),
    ('Anglomania', {'tags': ['noun']}), ('Anglocentric', {'tags': ['adj']}),
    ('apraxic', {'tags': ['adj']}), ('aglycosuric', {'tags': ['adj']}),
    ('asecretory', {'tags': ['adj']}), ('aleukaemic', {'tags': ['adj']}),
    ('agrin', {'tags': ['adj']}), ('Eurotransplant', {'tags': ['noun']}),
    ('Euromarket', {'tags': ['noun']}), ('Eurocentrism', {'tags': ['noun']}),
    ('adendritic', {'tags': ['adj']}), ('asynaptic', {'tags': ['adj']}),
    ('Asynapsis', {'tags': ['noun']}), ('ametabolic', {'tags': ['adj']})
]
@plac.annotations(
    lang=("ISO Code of language to use", "option", "l", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def main(lang="en", output_dir=None, n_iter=25):
    nlp = spacy.load('en_core_web_sm', disable=['ner', 'parser'])
    tagger = nlp.get_pipe('tagger')
    for tag, values in TAG_MAP.items():
        tagger.add_label(tag, values)
    nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
    optimizer = nlp.begin_training()
    for i in range(n_iter):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, losses=losses)
        print("Losses", losses)

    # test the trained model
    test_text = "I like Afrotropical apraxic blue eggs and Afrocentricity. A Eurotransplant is cool too. The agnathostomatous Euromarket and asypnapsis is even cooler. What about Eurocentrism?"
    doc = nlp(test_text)
    print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the save model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        doc = nlp2(test_text)
        print("Tags", [(t.text, t.tag_, t.pos_) for t in doc])


if __name__ == "__main__":
    plac.call(main)

推荐答案

英语模型在 PTB标签,而不是UD标签. spacy的标记图为您提供了一个很好的对应关系概念,但是PTB标记集比UD标记集更精细:

The English model is trained on PTB tags, not UD tags. spacy's tag map gives you a pretty good idea about the correspondences, but the PTB tagset is more fine-grained that the UD tagset:

https://github.com/explosion /spaCy/blob/master/spacy/lang/zh-CN/tag_map.py

跳过与tag_map相关的代码(模型中已经存在PTB-> UD映射),将数据中的标签更改为PTB标签(NN,NNS,JJ等),然后应运行此脚本. (当然,您仍然必须检查它是否运行良好.)

Skip the tag_map-related code (the PTB -> UD mapping is already there in the model), change your tags in your data to PTB tags (NN, NNS, JJ, etc.), and then this script should run. (You'll still have to check whether it performs well, of course.)

通常,最好提供带有完整短语或句子的培训示例,因为这就是在实际使用中会像测试句子那样标记虚假标记的原因.

In general, it's better to provide training examples with full phrases or sentences, since that's what spacy will be tagging in real usage like your test sentence.

这篇关于通过我自己的培训示例来培训spaCy现有的POS标记器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆