如何避免使用 Matcher 在 SpaCy 中重复提取重叠模式? [英] How to avoid double-extracting of overlapping patterns in SpaCy with Matcher?
问题描述
我需要通过 python Spacy Matcher 从 2 个列表中提取项目组合.问题如下:让我们有 2 个列表:
I need to extract item combination from 2 lists by means of python Spacy Matcher. The problem is following: Let us have 2 lists:
colors=['red','bright red','black','brown','dark brown']
animals=['fox','bear','hare','squirrel','wolf']
我通过以下代码匹配序列:
I match the sequences by the following code:
first_color=[]
last_color=[]
only_first_color=[]
for color in colors:
if ' ' in color:
first_color.append(color.split(' ')[0])
last_color.append(color.split(' ')[1])
else:
only_first_color.append(color)
matcher = Matcher(nlp.vocab)
pattern1 = [{"TEXT": {"IN": only_first_color}},{"TEXT":{"IN": animals}}]
pattern2 = [{"TEXT": {"IN": first_color}},{"TEXT": {"IN": last_color}},{"TEXT":{"IN": animals}}]
matcher.add("ANIMALS", None, pattern1,pattern2)
doc = nlp('bright red fox met black wolf')
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(start, end, span.text)
它给出了输出:
0 3 bright red fox
1 3 red fox
4 6 black wolf
如何只提取亮红狐"和黑狼"?我应该更改模式规则还是对匹配项进行后处理?
How can i extract only 'bright red fox' and 'black wolf'? Should i change the patterns rules or post-process the matches?
任何想法不胜感激!
推荐答案
您可以使用 <代码>spacy.util.filter_spans:
You may use spacy.util.filter_spans
:
过滤一系列 Span 对象并删除重复项或重叠项.用于创建命名实体(其中一个令牌只能是一部分一个实体的)或当与 Retokenizer.merge
合并跨度时.什么时候跨度重叠,(第一个)最长跨度优先于较短的跨度跨度.
Filter a sequence of Span objects and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with
Retokenizer.merge
. When spans overlap, the (first) longest span is preferred over shorter spans.
Python 代码:
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
print(span.start, span.end, span.text)
输出:
0 3 bright red fox
4 6 black wolf
这篇关于如何避免使用 Matcher 在 SpaCy 中重复提取重叠模式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!