使用反向引用不可能进行回顾 [英] Impossible lookbehind with a backreference

查看:61
本文介绍了使用反向引用不可能进行回顾的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

据我所知,

(.)(?

永远不应该匹配.实际上,php 的 preg_replace 甚至拒绝编译它,ruby 的 gsub 也是如此.python re 模块似乎有不同的意见:

导入重新测试 = 'xAAAAyBBBBz'打印 (re.sub(r'(.)(?)', test))

结果:

(x)AAAA(A)(y)BBB(B)(z)

谁能为这种行为提供合理的解释?

更新

此行为似乎是 re 模块中的限制.替代 regex 模块似乎可以正确处理断言中的组:

导入正则表达式测试 = 'xAAAAyBBBBz'打印 (regex.sub(r'(.)(?)', test))## xAAAAyBBBBz打印 (regex.sub(r'(.)(.)(?)', test))## (xA)AAA(Ay)BBB(Bz)

请注意,与 pcre 不同,regex 还允许可变宽度的后视:

print (regex.sub(r'(.)(?)', test))## (x)AAAAA(y)BBBB(z)

最终,regex 将包含在标准库中,如 PEP 411.

解决方案

这看起来确实像是 Python re 模块.

我想这与 Python 不支持可变长度后视断言这一事实​​有关,但它不够聪明,无法确定 \1 将始终是固定长度.为什么它在编译正则表达式时不抱怨这个,我不能说.

很有趣:

<预><代码>>>>打印 (re.sub(r'.(?)', test))(x)(A)(A)(A)(A)(A)(y)(B)(B)(B)(B)(z)>>>>>>re.compile(r'(.*)(?<!\1)') # 这应该会触发一个错误,但不会!<_sre.SRE_Pattern 对象在 0x00000000026A89C0>

所以最好不要在 Python 的后视断言中使用反向引用.正向后视也好不到哪里去(它在这里也匹配,就好像它是正向前瞻一样):

<预><代码>>>>打印 (re.sub(r'(.)(?<=\1)', r'(\g<0>)', test))x(A)(A)(A)(A)Ay(B)(B)(B)Bz

我什至无法猜测这里发生了什么:

<预><代码>>>>打印 (re.sub(r'(.+)(?<=\1)', r'(\g<0>)', test))x(AA)(A)(A)Ay(BB)(B)Bz

From my understanding,

(.)(?<!\1)

should never match. Actually, php's preg_replace even refuses to compile this and so does ruby's gsub. The python re module seems to have a different opinion though:

import re
test = 'xAAAAAyBBBBz'
print (re.sub(r'(.)(?<!\1)', r'(\g<0>)', test))

Result:

(x)AAAA(A)(y)BBB(B)(z)

Can anyone provide a reasonable explanation for this behavior?

Update

This behavior appears to be a limitation in the re module. The alternative regex module seems to handle groups in assertions correctly:

import regex

test = 'xAAAAAyBBBBz'

print (regex.sub(r'(.)(?<!\1)', r'(\g<0>)', test))
## xAAAAAyBBBBz

print (regex.sub(r'(.)(.)(?<!\1)', r'(\g<0>)', test))
## (xA)AAA(Ay)BBB(Bz)

Note that unlike pcre, regex also allows variable-width lookbehinds:

print (regex.sub(r'(.)(?<![A-Z]+)', r'(\g<0>)', test))
## (x)AAAAA(y)BBBB(z)

Eventually, regex is going to be included in the standard library, as mentioned in PEP 411.

解决方案

This does look like a limitation (nice way of saying "bug", as I learned from a support call with Microsoft) in the Python re module.

I guess it has to do with the fact that Python does not support variable-length lookbehind assertions, but it's not clever enough to figure out that \1 will always be fixed-length. Why it doesn't complain about this when compiling the regex, I can't say.

Funnily enough:

>>> print (re.sub(r'.(?<!\0)', r'(\g<0>)', test))
(x)(A)(A)(A)(A)(A)(y)(B)(B)(B)(B)(z)
>>>
>>> re.compile(r'(.*)(?<!\1)') # This should trigger an error but doesn't!
<_sre.SRE_Pattern object at 0x00000000026A89C0>

So better don't use backreferences in lookbehind assertions in Python. Positive lookbehind isn't much better (it also matches here as if it was a positive lookahead):

>>> print (re.sub(r'(.)(?<=\1)', r'(\g<0>)', test))
x(A)(A)(A)(A)Ay(B)(B)(B)Bz

And I can't even guess what's going on here:

>>> print (re.sub(r'(.+)(?<=\1)', r'(\g<0>)', test))
x(AA)(A)(A)Ay(BB)(B)Bz

这篇关于使用反向引用不可能进行回顾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆