在 SpaCy 中使用 PhraseMatcher 查找多个匹配类型 [英] Using PhraseMatcher in SpaCy to find multiple match types

查看:31
本文介绍了在 SpaCy 中使用 PhraseMatcher 查找多个匹配类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SpaCy 文档和示例表明 PhraseMatcher 类可用于匹配文档中的标记序列.必须提供将要匹配的序列词汇表.

The SpaCy documentation and samples show that the PhraseMatcher class is useful to match sequences of tokens in documents. One must provide a vocabulary of sequences that will be matched.

在我的应用程序中,我的文档是标记和短语的集合.有不同类型的实体.数据是远程自然语言(文档是一组具有半随机顺序的关键字).我正在尝试查找多种类型的匹配项.

In my application, I have documents that are collections of tokens and phrases. There are entities of different types. The data is remotely natural language (documents are rather set of keywords with semi-random order). I am trying to find matches of multiple types.

例如:

yellow boots for kids

如何使用 SpaCy 的 PhraseMatches 找到颜色(例如黄色)、产品类型(例如靴子)和年龄(例如儿童)的匹配项?这是一个很好的用例吗?如果不同的实体匹配重叠(例如颜色在颜色列表和材料列表中匹配),是否有可能产生所有独特的情况?

How can I find the matches for colors (e.g. yellow), for product types (e.g. boots) and for the age (e.g. kids) using SpaCy's PhraseMatches? Is this a good use case? If the different entity matches overlap (e.g. color is matched in colors list and in materials list), is it possible to produce all unique cases?

我无法真正使用序列标记器,因为数据结构松散且充满歧义.我有一个实体列表(例如颜色、年龄、产品类型)和关联的值列表.

I cannot really use a sequence tagger as the data is loosely structured and is riddled with ambiguities. I have a list of entities (e.g. colors, ager, product types) and associated value lists.

一个想法是实例化多个 PhraseMatcher 对象,每个实体一个,分别进行匹配,然后合并结果.每个实体类型都将获得自己的词汇表.这听起来很简单,但效率不高,尤其是合并部分.值列表相当大.在走这条路线之前,我想知道这是否是一个好主意,或者也许有更简单的方法可以使用 SpaCy 做到这一点.

One idea would be to instantiate multiple PhraseMatcher objects, one for each entity, do the matches separately and then merge the results. Each entity type will get its own vocabulary. This sounds straightforward but can be not efficient, especially the merging part. The value lists are fairly large. Before going this route, I would like to know if this is a good idea or perhaps there are simpler ways to do that with SpaCy.

推荐答案

spaCy 的 PhraseMatcher 支持添加包含多个模式的多个规则,并为您添加的每个匹配器规则分配 ID.如果两个规则重叠,则将返回两个匹配项.所以你可以做这样的事情:

spaCy's PhraseMatcher supports adding multiple rules containing several patterns, and assigning IDs to each matcher rule you add. If two rules overlap, both matches will be returned. So you could do something like this:

color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]

matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)

当您在 doc 上调用 matcher 时,spaCy 将返回一个 (match_id, start, end) 元组列表.由于 spaCy 将所有字符串存储为整数,因此您返回的 match_id 也将是一个整数 - 但您始终可以通过在词汇表的 StringStore 中查找来获取字符串表示, 即 nlp.vocab.strings:

When you call the matcher on your doc, spaCy will return a list of (match_id, start, end) tuples. Because spaCy stores all strings as integers, the match_id you get back will be an integer, too – but you can always get the string representation by looking it up in the vocabulary's StringStore, i.e. nlp.vocab.strings:

doc = nlp("yellow fabric")
matches = matcher(doc)
for match_id, start, end in matches:
    rule_id = nlp.vocab.strings[match_id]  # get the unicode ID, i.e. 'COLOR'
    span = doc[start : end]  # get the matched slice of the doc
    print(rule_id, span.text)

# COLOR yellow
# MATERIAL yellow fabric

添加匹配器规则时,您还可以定义on_match 回调function 作为 Matcher.add 的第二个参数.如果您想触发特定操作,这通常很有用 - 例如,如果找到 COLOR 匹配,则执行一项操作,而 PRODUCT 匹配则执行其他操作.

When you add matcher rules, you can also define an on_match callback function as the second argument of Matcher.add. This is often useful if you want to trigger specific actions – for example, do one thing if a COLOR match is found, and something else for a PRODUCT match.

如果您想更优雅地解决这个问题,您可能还想考虑将您的匹配器与 自定义管道组件自定义属性.例如,您可以编写一个简单的组件,当您对文本调用 nlp()、查找匹配项并设置 Doc._.contains_product 时自动运行该组件code>Token._.is_color 属性.文档中提供了一些这方面的示例,可以帮助您入门.

If you want to solve this even more elegantly, you might also want to look into combining your matcher with a custom pipeline component or custom attributes. For example, you could write a simple component that's run automatically when you call nlp() on your text, finds the matches, and sets a Doc._.contains_product or Token._.is_color attribute. The docs have a few examples of this that should help you get started.

这篇关于在 SpaCy 中使用 PhraseMatcher 查找多个匹配类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆