在spaCy中使用正则表达式:匹配各种(不同大小写)的单词 [英] Using regex in spaCy: matching various (different cased) words

查看:424
本文介绍了在spaCy中使用正则表达式:匹配各种(不同大小写)的单词的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

由于主题不正确而进行编辑

我想在SpaCy中使用正则表达式通过以下代码查找(应计或应计或年度或年度)请假的任意组合:

I want to use regex in SpaCy to find any combination of (Accrued or accrued or Annual or annual) leave by this code:

from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')

matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
matcher.add('LEAVE', None, 
            [{'TEXT': {"REGEX": "(Accrued|accrued|Annual|annual)"}}, 
             {'LOWER': 'leave'}])

# Call the matcher on the doc
doc= nlp('Annual leave shall be paid at the time . An employee is  to receive their annual leave payment in the normal pay cycle. Where an employee has accrued annual leave in')

matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print('- ', matched_span.sent.text)

# returned:
- Annual leave shall be paid at the time .
- An employee is  to receive their annual leave payment in the normal pay cycle.
- Where an employee has accrued annual leave in

但是,我认为我的正则表达式不够抽象/概括,不足以应用于其他情况,对于您如何使用spaCy改善我的正则表达式的建议,我将不胜感激.

However, I think my regex was not abstract/generalized enough to be applied to other situations, I would be very much appreciated for your advice on how to improve my regex expression with spaCy.

推荐答案

您的代码很好,您只需在ananual中输入一个错字,然后您的代码就会产生所有3个句子.

Your code is fine, you just have a typo in ananual and your code will yield all 3 sentences then.

但是,您无需重复大小写不同的单词.使用Python re正则表达式,您可以将 (?i)内联修饰符传递给模式开头,所有情况都可以不敏感.

However, you do not need to repeat the differently cased words. With Python re regex, you may pass the (?i) inline modifier to the pattern start and it will all be case insensitive.

您可以使用

"(?i)accrued|annual"

或者,要匹配整个单词,请添加单词边界\b:

Or, to match whole words, add word boundaries \b:

r"(?i)\b(?:accrued|annual)\b"

请注意,在开头"之前将r前缀设为字符串文字 raw ,而不必在其中转义\. r"\b" = "\\b".

Note the r prefix before the opening " making the string literal raw, and you do not have to escape \ in it. r"\b" = "\\b".

(?:...) 非捕获组可以确保将\b单词边界应用于组内的所有替代项.例如,\baccrued|annual\b将匹配accruednesssssbiannual(它将匹配以accrued开头或以annual结尾的单词).

The (?:...) non-capturing group is there to make sure \b word boundaries get applied to all the alternatives inside the group. \baccrued|annual\b will match accruednesssss or biannual, for example (it will match words that start with accrued or those ending with annual).

这篇关于在spaCy中使用正则表达式:匹配各种(不同大小写)的单词的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆