如何使用正则表达式找到最短的重叠匹配? [英] How do I find the shortest overlapping match using regular expressions?

查看:52
本文介绍了如何使用正则表达式找到最短的重叠匹配?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对正则表达式还是比较陌生.我试图找到与特定模式匹配的最短文本字符串,但如果最短模式是更大匹配的子字符串,我就会遇到问题.例如:

导入重新字符串 = "A|B|A|B|C|D|E|F|G"my_pattern = 'a.*?b.*?c'my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)匹配 = my_regex.findall(string)对于比赛中的比赛:打印匹配

打印:

A|B|A|B|C

但我希望它返回:

A|B|C

有没有办法做到这一点,而不必遍历每个匹配项以查看它是否包含匹配的子字符串?

解决方案

与此处的大多数其他答案相反,这可以使用 正向前瞻断言,带有 捕获组:

<预><代码>>>>my_pattern = '(?=(a.*?b.*?c))'>>>my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)>>>匹配 = my_regex.findall(string)>>>打印分钟(匹配,键= len)A|B|C

findall() 将返回所有可能的匹配项,因此您需要 min() 来获取最短的匹配项.

这是如何工作的:

  • 我们没有匹配这个正则表达式中的任何文本,只是匹配字符串中的位置(正则表达式引擎在匹配尝试期间逐步执行).
  • 在每个位置,正则表达式引擎会提前查看您的正则表达式是否在该位置匹配.
  • 如果是这样,它将被捕获组捕获.
  • 如果没有,就不会.
  • 在任何一种情况下,正则表达式引擎都会提前一个字符并重复该过程直到字符串结束.
  • 由于前瞻断言不消耗任何字符,因此将找到所有重叠的匹配项.

I'm still relatively new to regex. I'm trying to find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a substring of a larger match. For example:

import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'a.*?b.*?c'

my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string)

for match in matches:
    print match

prints:

A|B|A|B|C

but I'd want it to return:

A|B|C

Is there a way to do this without having to loop over each match to see if it contains a substring that matches?

解决方案

Contrary to most other answers here, this can be done in a single regex using a positive lookahead assertion with a capturing group:

>>> my_pattern = '(?=(a.*?b.*?c))'
>>> my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
>>> matches = my_regex.findall(string)
>>> print min(matches, key=len)
A|B|C

findall() will return all possible matches, so you need min() to get the shortest one.

How this works:

  • We're not matching any text in this regex, just positions in the string (which the regex engine steps through during a match attempt).
  • At each position, the regex engine looks ahead to see whether your regex would match at this position.
  • If so, it will be captured by the capturing group.
  • If not, it won't.
  • In either case, the regex engine then steps ahead one character and repeats the process until the end of the string.
  • Since the lookahead assertion doesn't consume any characters, all overlapping matches will be found.

这篇关于如何使用正则表达式找到最短的重叠匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆