python re.findall() 交替使用子字符串 [英] python re.findall() with substring in alternations

查看:42
本文介绍了python re.findall() 交替使用子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我在正则表达式交替中有另一个字符串或模式的子字符串(或子模式"),如下所示:

If I have a substring (or 'subpattern') of another string or pattern in a regex alternation, like so:

r'abcd|bc'

re.compile(r'abcd|bc').findall('abcd bcd bc ab') 的预期行为是什么?

尝试一下,我得到(如预期)

Trying it out, I get (as expected)

['abcd', 'bc', 'bc']

所以我认为 re.compile(r'bc|abcd').findall('abcd bcd bc ab') 可能会产生 ['bc', 'bc', 'bc'] 而是它再次返回

so I thought re.compile(r'bc|abcd').findall('abcd bcd bc ab') might yield ['bc', 'bc', 'bc'] but instead it again returns

['abcd', 'bc', 'bc']

有人能解释一下吗?我的印象是 findall 会贪婪地返回匹配项,但显然,它会回溯并尝试匹配会产生更长标记的替代模式.

Can someone explain this? I was under the impression that findall would greedily return matches but apparently, it backtracks and tries to match alternate patterns what would yield longer tokens.

推荐答案

根本不发生回溯.您的模式匹配两种不同类型的字符串;| 表示.每个模式都在每个位置进行尝试.

No backtracking takes place at all. Your pattern matches two different types of strings; | means or. Each pattern is tried out at each position.

因此,当表达式在您输入的开头找到 abcd 时,该文本与您的模式匹配得很好,它适合 (bc abcd) 模式.

So when the expression finds abcd at the start of your input, that text matches your pattern just fine, it fits the abcd part of the (bc or abcd) pattern you gave it.

替代部分的排序在这里不起作用,就正则表达式引擎而言,abcd|bcbc 是 相同的东西|abcd.abcd 不会仅仅因为 bc 可能在字符串中稍后匹配而被忽略.

Ordering of the alternative parts doesn't play here, as far as the regular expression engine is concerned, abcd|bc is the same thing as bc|abcd. abcd is not disregarded just because bc might match later on in the string.

这篇关于python re.findall() 交替使用子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆