SPACY 自定义 NER 不返回任何实体 [英] SPACY custom NER is not returning any entity

查看:63
本文介绍了SPACY 自定义 NER 不返回任何实体的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试训练一个 Spacy 模型来识别一些自定义的 NER,下面给出了训练数据,主要与识别一些服务器模型、FY 格式的日期和 HDD 类型有关:

I am trying to train a Spacy model to recognize a few custom NERs, the training data is given below, it is mostly related to recognizing a few server models, date in the FY format and Types of HDD:

TRAIN_DATA = [('Send me the number of units shipped in FY21 for A566TY server', {'entities': [(39, 42, 'DateParse'),(48,53,'server')]}),
            ('Send me the number of units shipped in FY-21 for A5890Y server', {'entities': [(39, 43, 'DateParse'),(49,53,'server')]}),              
          ('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),              
          ('Total revenue in FY20Q2 for 3.5 HDD', {'entities': [(17, 22, 'DateParse'),(28,30,'HDD')]}),
          ('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),

          ('Total units shipped in FY2017-FY2021', {'entities': [(23, 28, 'DateParse'),(30,35,'DateParse')]}),
          ('Total units shipped in FY 18', {'entities': [(23, 27, 'DateParse')]}),
          ('Total units shipped between FY16 and FY2021', {'entities': [(28, 31, 'DateParse'),(37,42,'DateParse')]})
         ]
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en')  # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)


# add labels
for _, annotations in TRAIN_DATA:
     for ent in annotations.get('entities'):
        ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes):  # only train NER
    optimizer = nlp.begin_training()
    for itn in range(iterations):
        print("Statring iteration " + str(itn))
        random.shuffle(TRAIN_DATA)
        losses = {}
         # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(
                texts,  # batch of texts
                annotations,  # batch of annotations
                drop=0.2,  # dropout - make it harder to memorise data
                losses=losses,
            )
        print("Losses", losses)
return nlp

但是即使在训练数据上运行代码也不会返回任何实体.

But on running the code even on training data no entity is being returned.

prdnlp = train_spacy(TRAIN_DATA, 100)
for text, _ in TRAIN_DATA:
    doc = prdnlp(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

输出如下:

推荐答案

Spacy 目前只能从与标记边界对齐的实体注释进行训练.主要问题是您的跨度结束字符太短了一个字符.字符开始/结束值应该就像文本的字符串切片:

Spacy can currently only train from entity annotation that lines up with token boundaries. The main problem is that your span end characters are one character too short. The character start/end values should be just like string slices for the text:

text = "Send me the number of units shipped in FY21 for A566TY server"
# (39, 42, 'DateParse')
assert text[39:42] == "FY2"

您应该使用 (39, 43, 'DateParse') 代替.

第二个问题是您可能还需要针对 FY2017-FY2021 等情况调整分词器,因为默认的英文分词器将其视为一个标记,因此注释 [(23,28, 'DateParse'),(30,35,'DateParse')] 在训练期间将被忽略.

A secondary problem is that you may also need to adjust the tokenizer for cases like FY2017-FY2021 because the default English tokenizer treats this as one token, so the annotations [(23, 28, 'DateParse'),(30,35,'DateParse')] would be ignored during training.

在此处查看更详细的解释:https://github.com/爆炸/空间/问题/4946#issuecomment-580663925

See a more detailed explanation here: https://github.com/explosion/spaCy/issues/4946#issuecomment-580663925

这篇关于SPACY 自定义 NER 不返回任何实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆