基于 Spacy 令牌的匹配,令牌之间有“n"个令牌 [英] Spacy token-based matching with 'n' number of tokens between tokens

查看:42
本文介绍了基于 Spacy 令牌的匹配,令牌之间有“n"个令牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 spacy 来匹配某些文本(意大利语)中的特定表达式.我的文本可以以多种形式出现,我正在尝试了解编写一般规则的最佳方式是什么.我有以下 4 个案例,我想写一个适用于所有案例的通用模式.类似的东西:

I am using spacy to match a particular expression in some text (in italian). My text can appear in multiple forms and I am trying to learn what's the best way to write a general rule. I have 4 cases as below,, and I would like to write a general patter that could work with all of the cases. Something like:

# case 1
text = 'Superfici principali e secondarie: 90 mq'
# case 2
# text = 'Superfici principali e secondarie di 90 mq'
# case 3
# text = 'Superfici principali e secondarie circa 90 mq'
# case 4
# text = 'Superfici principali e secondarie di circa 90 mq'

nlp = spacy.load('it_core_news_sm')
doc = nlp(text)

matcher = Matcher(nlp.vocab) 

pattern = [{"LOWER": "superfici"}, {"LOWER": "principali"}, {"LOWER": "e"}, {"LOWER": "secondarie"},  << "some token here that allows max 3 tokens or a IS_PUNCT or nothing at all" >>, {"IS_DIGIT": True}, {"LOWER": "mq"}]

matcher.add("Superficie", None, pattern)

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

推荐答案

您可以添加一个 {"IS_PUNCT": True, "OP": "?"} 可选标记,然后添加三个可选 <代码>IS_ALPHA 令牌:

You may add a {"IS_PUNCT": True, "OP": "?"} optional token and then three optional IS_ALPHA tokens:

pattern = [
            {"LOWER": "superfici"}, 
            {"LOWER": "principali"},
            {"LOWER": "e"},
            {"LOWER": "secondarie"},
            {"IS_PUNCT": True, "OP": "?"},
            {"IS_ALPHA": True, "OP": "?"},
            {"IS_ALPHA": True, "OP": "?"},
            {"IS_ALPHA": True, "OP": "?"},
            {"IS_DIGIT": True},
            {"LOWER": "mq"}
          ]

"OP" : "?" 表示令牌可以重复 1 次或 0 次,即它只能出现一次或消失.

The "OP" : "?" means the token can repeat 1 or 0 times, i.e. it can appear only once or go missing.

这篇关于基于 Spacy 令牌的匹配,令牌之间有“n"个令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆