如何使用正则表达式找到最短的重叠匹配? [英] How do I find the shortest overlapping match using regular expressions?
问题描述
我对正则表达式还是比较陌生.我试图找到与特定模式匹配的最短文本字符串,但如果最短模式是更大匹配的子字符串,我就会遇到问题.例如:
导入重新字符串 = "A|B|A|B|C|D|E|F|G"my_pattern = 'a.*?b.*?c'my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)匹配 = my_regex.findall(string)对于比赛中的比赛:打印匹配
打印:
A|B|A|B|C
但我希望它返回:
A|B|C
有没有办法做到这一点,而不必遍历每个匹配项以查看它是否包含匹配的子字符串?
与此处的大多数其他答案相反,这可以使用 正向前瞻断言,带有 捕获组:
<预><代码>>>>my_pattern = '(?=(a.*?b.*?c))'>>>my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)>>>匹配 = my_regex.findall(string)>>>打印分钟(匹配,键= len)A|B|Cfindall()
将返回所有可能的匹配项,因此您需要 min()
来获取最短的匹配项.
这是如何工作的:
- 我们没有匹配这个正则表达式中的任何文本,只是匹配字符串中的位置(正则表达式引擎在匹配尝试期间逐步执行).
- 在每个位置,正则表达式引擎会提前查看您的正则表达式是否在该位置匹配.
- 如果是这样,它将被捕获组捕获.
- 如果没有,就不会.
- 在任何一种情况下,正则表达式引擎都会提前一个字符并重复该过程直到字符串结束.
- 由于前瞻断言不消耗任何字符,因此将找到所有重叠的匹配项.
I'm still relatively new to regex. I'm trying to find the shortest string of text that matches a particular pattern, but am having trouble if the shortest pattern is a substring of a larger match. For example:
import re
string = "A|B|A|B|C|D|E|F|G"
my_pattern = 'a.*?b.*?c'
my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
matches = my_regex.findall(string)
for match in matches:
print match
prints:
A|B|A|B|C
but I'd want it to return:
A|B|C
Is there a way to do this without having to loop over each match to see if it contains a substring that matches?
Contrary to most other answers here, this can be done in a single regex using a positive lookahead assertion with a capturing group:
>>> my_pattern = '(?=(a.*?b.*?c))'
>>> my_regex = re.compile(my_pattern, re.DOTALL|re.IGNORECASE)
>>> matches = my_regex.findall(string)
>>> print min(matches, key=len)
A|B|C
findall()
will return all possible matches, so you need min()
to get the shortest one.
How this works:
- We're not matching any text in this regex, just positions in the string (which the regex engine steps through during a match attempt).
- At each position, the regex engine looks ahead to see whether your regex would match at this position.
- If so, it will be captured by the capturing group.
- If not, it won't.
- In either case, the regex engine then steps ahead one character and repeats the process until the end of the string.
- Since the lookahead assertion doesn't consume any characters, all overlapping matches will be found.
这篇关于如何使用正则表达式找到最短的重叠匹配?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!