最长匹配仅与 Spacy Phrasematcher [英] Longest match only with Spacy Phrasematcher

查看:69
本文介绍了最长匹配仅与 Spacy Phrasematcher的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了一个 Spacy Phrasematcher 来匹配文档中的名称,遵循 教程.我想使用结果匹配作为额外的训练数据来训练一个 Spacy NER 模型.但是,我的模式分别包含全名(例如Barack Obama")和姓氏(Obama").

I have created a Spacy Phrasematcher to match names in a document, following the tutorial. I want to use the resulting matches as additional training data in order to train a Spacy NER model. My patterns, however, contain both full names (e.g. 'Barack Obama') and last names ('Obama') separately.

因此,在包含Barack Obama"的句子中,两种模式都匹配,导致匹配重叠.但是,当我尝试使用数据进行训练时,这种重叠会触发异常,例如:

Hence, in a sentence that contains 'Barack Obama', both patterns match, resulting in overlapping matches. This overlap, however, triggers an exception when I try to use the data for training, e.g.:

ValueError: [E103] Trying to set conflicting doc.ents: '(19, 33, 'PERSON')' and '(29, 33, 'PERSON')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

我一直在考虑在使用数据进行训练之前过滤掉重叠的匹配,但这似乎是一种非常低效的方法,导致处理大数据的时间显着增加.

I've been considering to filter out overlapping matches before using the data for training, but this seems like a very inefficient approach, resulting in a significant increase in processing time for large data.

有没有办法设置 PhraseMatcher 以便它只匹配最长匹配的重叠匹配?

Is there a way to set up a PhraseMatcher so that it only matches the longest match for overlapping matches?

推荐答案

PhraseMatcher 没有内置的方法在匹配时过滤掉重叠的匹配,但有一个实用函数之后过滤重叠匹配:spacy.util.filter_spans().它更喜欢最长的跨度,如果两个重叠的跨度长度相同,则文本中的跨度越早.

The PhraseMatcher doesn't have a built-in way to filter out overlapping matches while it's matching, but there is a utility function to filter overlapping matches afterwards: spacy.util.filter_spans(). It prefers the longest span and if two overlapping spans are the same length, the earlier span in the text.

这篇关于最长匹配仅与 Spacy Phrasematcher的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆