使用反向引用不可能进行回顾 [英] Impossible lookbehind with a backreference
问题描述
据我所知,
(.)(?
永远不应该匹配.实际上,php 的 preg_replace
甚至拒绝编译它,ruby 的 gsub
也是如此.python re
模块似乎有不同的意见:
导入重新测试 = 'xAAAAyBBBBz'打印 (re.sub(r'(.)(?)', test))
结果:
(x)AAAA(A)(y)BBB(B)(z)
谁能为这种行为提供合理的解释?
更新
此行为似乎是 re
模块中的限制.替代 regex
模块似乎可以正确处理断言中的组:
导入正则表达式测试 = 'xAAAAyBBBBz'打印 (regex.sub(r'(.)(?)', test))## xAAAAyBBBBz打印 (regex.sub(r'(.)(.)(?)', test))## (xA)AAA(Ay)BBB(Bz)
请注意,与 pcre
不同,regex
还允许可变宽度的后视:
print (regex.sub(r'(.)(?)', test))## (x)AAAAA(y)BBBB(z)
最终,regex
将包含在标准库中,如 PEP 411.
这看起来确实像是 Python re代码> 模块.
我想这与 Python 不支持可变长度后视断言这一事实有关,但它不够聪明,无法确定 \1
将始终是固定长度.为什么它在编译正则表达式时不抱怨这个,我不能说.
很有趣:
<预><代码>>>>打印 (re.sub(r'.(?)', test))(x)(A)(A)(A)(A)(A)(y)(B)(B)(B)(B)(z)>>>>>>re.compile(r'(.*)(?<!\1)') # 这应该会触发一个错误,但不会!<_sre.SRE_Pattern 对象在 0x00000000026A89C0>所以最好不要在 Python 的后视断言中使用反向引用.正向后视也好不到哪里去(它在这里也匹配,就好像它是正向前瞻一样):
<预><代码>>>>打印 (re.sub(r'(.)(?<=\1)', r'(\g<0>)', test))x(A)(A)(A)(A)Ay(B)(B)(B)Bz我什至无法猜测这里发生了什么:
<预><代码>>>>打印 (re.sub(r'(.+)(?<=\1)', r'(\g<0>)', test))x(AA)(A)(A)Ay(BB)(B)BzFrom my understanding,
(.)(?<!\1)
should never match. Actually, php's preg_replace
even refuses to compile this and so does ruby's gsub
. The python re
module seems to have a different opinion though:
import re
test = 'xAAAAAyBBBBz'
print (re.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
Result:
(x)AAAA(A)(y)BBB(B)(z)
Can anyone provide a reasonable explanation for this behavior?
Update
This behavior appears to be a limitation in the re
module. The alternative regex
module seems to handle groups in assertions correctly:
import regex
test = 'xAAAAAyBBBBz'
print (regex.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
## xAAAAAyBBBBz
print (regex.sub(r'(.)(.)(?<!\1)', r'(\g<0>)', test))
## (xA)AAA(Ay)BBB(Bz)
Note that unlike pcre
, regex
also allows variable-width lookbehinds:
print (regex.sub(r'(.)(?<![A-Z]+)', r'(\g<0>)', test))
## (x)AAAAA(y)BBBB(z)
Eventually, regex
is going to be included in the standard library, as mentioned in PEP 411.
This does look like a limitation (nice way of saying "bug", as I learned from a support call with Microsoft) in the Python re
module.
I guess it has to do with the fact that Python does not support variable-length lookbehind assertions, but it's not clever enough to figure out that \1
will always be fixed-length. Why it doesn't complain about this when compiling the regex, I can't say.
Funnily enough:
>>> print (re.sub(r'.(?<!\0)', r'(\g<0>)', test))
(x)(A)(A)(A)(A)(A)(y)(B)(B)(B)(B)(z)
>>>
>>> re.compile(r'(.*)(?<!\1)') # This should trigger an error but doesn't!
<_sre.SRE_Pattern object at 0x00000000026A89C0>
So better don't use backreferences in lookbehind assertions in Python. Positive lookbehind isn't much better (it also matches here as if it was a positive lookahead):
>>> print (re.sub(r'(.)(?<=\1)', r'(\g<0>)', test))
x(A)(A)(A)(A)Ay(B)(B)(B)Bz
And I can't even guess what's going on here:
>>> print (re.sub(r'(.+)(?<=\1)', r'(\g<0>)', test))
x(AA)(A)(A)Ay(BB)(B)Bz
这篇关于使用反向引用不可能进行回顾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!