为什么 re.sub('.*?', '-', 'abc') 返回 '-a-b-c-' 而不是 '-------'? [英] Why does re.sub('.*?', '-', 'abc') return '-a-b-c-' instead of '-------'?
问题描述
这是python2.7的结果.
<预><代码>>>>re.sub('.*?', '-', 'abc')'-a-b-c-'我认为应该的结果如下.
<预><代码>>>>re.sub('.*?', '-', 'abc')'-------'但事实并非如此.为什么?
我所知道的对这种行为的最佳解释来自 regex
PyPI 包,旨在最终替换 re
(尽管它一直是这样用于好久不见).
有时不清楚应该如何处理零宽度匹配.比如.*匹配>0个字符后直接匹配0个字符吗?
大多数正则表达式的实现都遵循 Perl (PCRE) 的原则,但 re 模块有时不这样做.Perl 行为似乎是最常见的(re 模块有时肯定是错误的),因此在版本 1 中,regex 模块遵循 Perl 行为,而在版本 0 中,它遵循旧版 re 行为.
示例:
# 版本 0 行为(如 re)>>>regex.sub('(?V0).*', 'x', 'test')'X'>>>regex.sub('(?V0).*?', '|', 'test')'|t|e|s|t|'# 版本 1 的行为(如 Perl)>>>regex.sub('(?V1).*', 'x', 'test')'xx'>>>regex.sub('(?V1).*?', '|', 'test')'|||||||||'
(?VX)
在正则表达式中设置版本标志.第二个示例是您所期望的,并且应该是 PCRE 所做的.Python 的 re
有点不标准,保留它的原因可能仅仅是出于向后兼容性的考虑.我发现了一个类似的例子(使用re.split
).
This is the results from python2.7.
>>> re.sub('.*?', '-', 'abc')
'-a-b-c-'
The results I thought should be as follows.
>>> re.sub('.*?', '-', 'abc')
'-------'
But it's not. Why?
The best explanation of this behaviour I know of is from the regex
PyPI package, which is intended to eventually replace re
(although it has been this way for a long time now).
Sometimes it’s not clear how zero-width matches should be handled. For example, should .* match 0 characters directly after matching >0 characters?
Most regex implementations follow the lead of Perl (PCRE), but the re module sometimes doesn’t. The Perl behaviour appears to be the most common (and the re module is sometimes definitely wrong), so in version 1 the regex module follows the Perl behaviour, whereas in version 0 it follows the legacy re behaviour.
Examples:
# Version 0 behaviour (like re) >>> regex.sub('(?V0).*', 'x', 'test') 'x' >>> regex.sub('(?V0).*?', '|', 'test') '|t|e|s|t|' # Version 1 behaviour (like Perl) >>> regex.sub('(?V1).*', 'x', 'test') 'xx' >>> regex.sub('(?V1).*?', '|', 'test') '|||||||||'
(?VX)
sets the version flag in the regex. The second example is what you expect, and is supposedly what PCRE does. Python's re
is somewhat nonstandard, and is kept as it is probably solely due to backwards compatibility concerns. I've found an example of something similar (with re.split
).
这篇关于为什么 re.sub('.*?', '-', 'abc') 返回 '-a-b-c-' 而不是 '-------'?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!