在SpaCy中使用PhraseMatcher查找多种匹配类型 [英] Using PhraseMatcher in SpaCy to find multiple match types

查看:563
本文介绍了在SpaCy中使用PhraseMatcher查找多种匹配类型的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

SpaCy文档和示例显示PhraseMatcher类对于匹配文档中的标记序列很有用.必须提供将要匹配的序列的词汇表.

The SpaCy documentation and samples show that the PhraseMatcher class is useful to match sequences of tokens in documents. One must provide a vocabulary of sequences that will be matched.

在我的应用程序中,我有一些包含标记和短语的文档.有不同类型的实体.数据是远程自然语言(文档是一组具有半随机顺序的关键字).我正在尝试查找多种类型的匹配项.

In my application, I have documents that are collections of tokens and phrases. There are entities of different types. The data is remotely natural language (documents are rather set of keywords with semi-random order). I am trying to find matches of multiple types.

例如:

yellow boots for kids

如何使用SpaCy的PhraseMatches查找颜色(例如黄色),产品类型(例如靴子)和年龄(例如孩子)的匹配项?这是一个好用例吗?如果不同的实体匹配重叠(例如,颜色在颜色列表和物料列表中匹配),是否有可能产生所有唯一的情况?

How can I find the matches for colors (e.g. yellow), for product types (e.g. boots) and for the age (e.g. kids) using SpaCy's PhraseMatches? Is this a good use case? If the different entity matches overlap (e.g. color is matched in colors list and in materials list), is it possible to produce all unique cases?

我不能真正使用序列标记器,因为数据结构松散且充满歧义.我有一个实体列表(例如颜色,颜色,产品类型)和关联的值列表.

I cannot really use a sequence tagger as the data is loosely structured and is riddled with ambiguities. I have a list of entities (e.g. colors, ager, product types) and associated value lists.

一个想法是实例化多个PhraseMatcher对象,每个对象一个,分别进行匹配,然后合并结果.每种实体类型都有自己的词汇表.这听起来很简单,但效率不高,尤其是合并部分.值列表很大.在走这条路线之前,我想知道这是一个好主意还是使用SpaCy做到这一点的简单方法.

One idea would be to instantiate multiple PhraseMatcher objects, one for each entity, do the matches separately and then merge the results. Each entity type will get its own vocabulary. This sounds straightforward but can be not efficient, especially the merging part. The value lists are fairly large. Before going this route, I would like to know if this is a good idea or perhaps there are simpler ways to do that with SpaCy.

推荐答案

spaCy的PhraseMatcher支持添加包含多个模式的多个规则,并为您添加的每个匹配器规则分配ID.如果两个规则重叠,则将返回两个匹配项.因此,您可以执行以下操作:

spaCy's PhraseMatcher supports adding multiple rules containing several patterns, and assigning IDs to each matcher rule you add. If two rules overlap, both matches will be returned. So you could do something like this:

color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]

matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)

当您在doc上调用matcher时,spaCy将返回(match_id, start, end)元组的列表.因为spaCy将所有字符串存储为整数,所以您返回的match_id也将是整数–但是您始终可以通过在词汇表的StringStore中查找它来获取字符串表示形式,即nlp.vocab.strings:

When you call the matcher on your doc, spaCy will return a list of (match_id, start, end) tuples. Because spaCy stores all strings as integers, the match_id you get back will be an integer, too – but you can always get the string representation by looking it up in the vocabulary's StringStore, i.e. nlp.vocab.strings:

doc = nlp("yellow fabric")
matches = matcher(doc)
for match_id, start, end in matches:
    rule_id = nlp.vocab.strings[match_id]  # get the unicode ID, i.e. 'COLOR'
    span = doc[start : end]  # get the matched slice of the doc
    print(rule_id, span.text)

# COLOR yellow
# MATERIAL yellow fabric

添加匹配器规则时,您还可以定义 on_match回调函数作为Matcher.add的第二个参数.如果您想触发特定的操作,这通常很有用-例如,如果找到COLOR匹配项,则执行一件事,而对于PRODUCT匹配项,则进行其他操作.

When you add matcher rules, you can also define an on_match callback function as the second argument of Matcher.add. This is often useful if you want to trigger specific actions – for example, do one thing if a COLOR match is found, and something else for a PRODUCT match.

如果您想更优雅地解决此问题,则可能还需要考虑将匹配器与自定义属性.例如,您可以编写一个简单的组件,当您在文本上调用nlp()时,它会自动运行,查找匹配项并设置Doc._.contains_productToken._.is_color属性.文档中有一些示例,可以帮助您入门.

If you want to solve this even more elegantly, you might also want to look into combining your matcher with a custom pipeline component or custom attributes. For example, you could write a simple component that's run automatically when you call nlp() on your text, finds the matches, and sets a Doc._.contains_product or Token._.is_color attribute. The docs have a few examples of this that should help you get started.

这篇关于在SpaCy中使用PhraseMatcher查找多种匹配类型的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆