SPACY 自定义 NER 不返回任何实体 [英] SPACY custom NER is not returning any entity
问题描述
我正在尝试训练一个 Spacy 模型来识别一些自定义的 NER,下面给出了训练数据,主要与识别一些服务器模型、FY 格式的日期和 HDD 类型有关:
I am trying to train a Spacy model to recognize a few custom NERs, the training data is given below, it is mostly related to recognizing a few server models, date in the FY format and Types of HDD:
TRAIN_DATA = [('Send me the number of units shipped in FY21 for A566TY server', {'entities': [(39, 42, 'DateParse'),(48,53,'server')]}),
('Send me the number of units shipped in FY-21 for A5890Y server', {'entities': [(39, 43, 'DateParse'),(49,53,'server')]}),
('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),
('Total revenue in FY20Q2 for 3.5 HDD', {'entities': [(17, 22, 'DateParse'),(28,30,'HDD')]}),
('How many systems sold with 3.5 inch drives in FY20-Q2 for F567?', {'entities': [(46, 52, 'DateParse'),(58,61,'server'),(27,29,'HDD')]}),
('Total units shipped in FY2017-FY2021', {'entities': [(23, 28, 'DateParse'),(30,35,'DateParse')]}),
('Total units shipped in FY 18', {'entities': [(23, 27, 'DateParse')]}),
('Total units shipped between FY16 and FY2021', {'entities': [(28, 31, 'DateParse'),(37,42,'DateParse')]})
]
def train_spacy(data,iterations):
TRAIN_DATA = data
nlp = spacy.blank('en') # create blank Language class
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# add labels
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
with nlp.disable_pipes(*other_pipes): # only train NER
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Statring iteration " + str(itn))
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
drop=0.2, # dropout - make it harder to memorise data
losses=losses,
)
print("Losses", losses)
return nlp
但是即使在训练数据上运行代码也不会返回任何实体.
But on running the code even on training data no entity is being returned.
prdnlp = train_spacy(TRAIN_DATA, 100)
for text, _ in TRAIN_DATA:
doc = prdnlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])
输出如下:
推荐答案
Spacy 目前只能从与标记边界对齐的实体注释进行训练.主要问题是您的跨度结束字符太短了一个字符.字符开始/结束值应该就像文本的字符串切片:
Spacy can currently only train from entity annotation that lines up with token boundaries. The main problem is that your span end characters are one character too short. The character start/end values should be just like string slices for the text:
text = "Send me the number of units shipped in FY21 for A566TY server"
# (39, 42, 'DateParse')
assert text[39:42] == "FY2"
您应该使用 (39, 43, 'DateParse')
代替.
第二个问题是您可能还需要针对 FY2017-FY2021
等情况调整分词器,因为默认的英文分词器将其视为一个标记,因此注释 [(23,28, 'DateParse'),(30,35,'DateParse')]
在训练期间将被忽略.
A secondary problem is that you may also need to adjust the tokenizer for cases like FY2017-FY2021
because the default English tokenizer treats this as one token, so the annotations [(23, 28, 'DateParse'),(30,35,'DateParse')]
would be ignored during training.
在此处查看更详细的解释:https://github.com/爆炸/空间/问题/4946#issuecomment-580663925
See a more detailed explanation here: https://github.com/explosion/spaCy/issues/4946#issuecomment-580663925
这篇关于SPACY 自定义 NER 不返回任何实体的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!