如何使用正则表达式将缩写与其含义相匹配? [英] how to match abbreviations with their meaning with regex?

查看:61
本文介绍了如何使用正则表达式将缩写与其含义相匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找与以下字符串匹配的正则表达式模式:

I'm looking for a regex pattern that matches the following string:

一些示例文本 (SET) 演示了我正在寻找的内容.能源系统模型 (ESM) 用于寻找特定最优值 (SCO).有人说计算机系统 (CUST) 很酷.夏天在外面玩(OUTS)应该是首选.

Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred.

我的目标是匹配以下内容:

My goal is to match the following:

Some example text (SET)
Energy system models (ESM)
specific optima (SCO)
computer systems (CUST)
outside (OUTS)

重要的部分是它并不总是三个单词和它们的第一个字母.有时用于缩写的字母仅包含在前面的单词中.这就是我开始研究 positive lookbehind 的原因.但是,它受长度限制,可以通过将其与 positive lookahead 结合来解决这个问题.到目前为止,我无法想出一个强大的解决方案.

The important part is that it's not always exactly three words and their first letter. Sometimes the letters used for the abbreviation are merely contained in the preceding words. That's why I started looking into the positive lookbehind. However, it is constrained by length, which can be worked around by combining it with a positive lookahead. So far I couldn't come up with a robust solution though.

到目前为止我尝试过的:

(\b[\w -]+?)\((([A-Z])(?<=(?=.*?\3))(?:[A-Z]){1,4})\)

这很有效,但匹配包含的单词太多:

This works reasonable well but matches include too many words:

Some example text (SET)
Energy system models (ESM)
are used to find specific optima (SCO)
Some say Computer systems (CUST)
In the summer playing outside (OUTS)

我也尝试在第一组的开头引用缩写的第一个字母.但这根本不起作用.

I have also tried to use a reference to the first letter of the abbreviation at the start of the first group. That didn't work at all though.

我看过但没有发现有用的东西:

Things I have looked at but didn't find useful:

有用的资源:

推荐答案

我建议使用

import re
def contains_abbrev(abbrev, text):
    text = text.lower()
    if not abbrev.isupper():
        return False
    cnt = 0
    for c in abbrev.lower():
        if text.find(c) > -1:
            text = text[text.find(c):]
            cnt += 1
            continue
    return cnt == len(abbrev)
 
text= "Some example text (SET) that demonstrates what I'm looking for. Energy system models (ESM) are used to find specific optima (SCO). Some say computer systems (CUST) are cool. In the summer playing outside (OUTS) should be preferred. Stupid example(s) Stupid example(S) Not stupid example (NSEMPLE), bad example (Bexle)"
abbrev_rx = r'\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)'
print( [x.group() for x in re.finditer(abbrev_rx, text, re.I) if contains_abbrev(x.group(3), x.group(1))] )

请参阅 Python 演示.

使用的正则表达式是

(?i)\b(([A-Z])\w*(?:\s+\w+)*?)\s*\((\2[A-Z]*)\)

请参阅正则表达式演示.详情:

  • \b - 词边界
  • (([AZ])\w*(?:\s+\w+)*?) - Group 1 (text):捕获到 Group 中的 ASCII 字母2,然后是 0+ 个单词字符,然后是任何 0 个或多个出现的 1+ 空格,然后是 1+ 个单词字符,尽可能少
  • \s* - 0+ 个空格
  • \( - 一个 ( 字符
  • (\2[A-Z]*) - 第 3 组 (abbrev):与第 2 组中的值相同,然后是 0 个或多个 ASCII 字母
  • \) - ) 字符.
  • \b - word boundary
  • (([A-Z])\w*(?:\s+\w+)*?) - Group 1 (text): an ASCII letter captured into Group 2, then 0+ word chars followed with any 0 or more occurrences of 1+ whitespaces followed with 1+ word chars, as few as possible
  • \s* - 0+ whitespaces
  • \( - a ( char
  • (\2[A-Z]*) - Group 3 (abbrev): same value as in Group 2 and then 0 or more ASCII letters
  • \) - a ) char.

一旦匹配,第 3 组作为 abbrev 传递,第 1 组作为 text 传递给 contains_abbrev(abbrev, text)方法,确保 abbrev 是大写字符串,并且 abbrev 中的字符与 text 中的字符顺序相同,并且是全部出现在 text 中.

Once there is a match, Group 3 is passed as abbrev and Group 1 is passedas text to the contains_abbrev(abbrev, text) method, that makes sure that the abbrev is an uppercase string and that the chars in abbrev go in the same order as in text, and are all present in the text.

这篇关于如何使用正则表达式将缩写与其含义相匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆