sed 和 python 正则表达式不一致 [英] Inconsistency between sed and python regular expressions

查看:41
本文介绍了sed 和 python 正则表达式不一致的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果它在某处发布,我深表歉意,但我粗略搜索没有找到任何东西.

在进行一些 Python 编程时,我注意到以下命令:

re.sub("a*((ab)*)b", r"\1", "aabb")

返回空字符串.但是 sed 中的等效命令:

echo "aabb" |sed "s/a*\(\(ab\)*\)b/\1/"

返回ab.

对我来说,python 正则表达式开头的a*"指令将匹配两个 a,导致(ab)*"匹配零次,但我不知道 sed 是如何产生 ab 的.有人知道导致这种情况的两个正则表达式引擎之间有什么区别吗?我相信默认情况下它们都会贪婪地匹配星星,但我突然想到 sed 可能会从右侧而不是左侧进行匹配.任何见解将不胜感激.

解决方案

您构建的有趣谜题.从我读到的内容来看,python 和 sed 的正则表达式引擎都基于 Henry Spencer 的正则表达式库(如 perl 的),它依赖于回溯.(不幸的是,我找不到我以此为基础的文章).

无论如何,这不是应该是实现细节的东西:Python 的行为违反了 POSIX 标准,该标准要求 RE (a) 尽早匹配,并且 (b)匹配从该点开始的最长可能的字符串.(请参阅 man 7 regex(在 Linux 上),还有很多.)

要找到最长的匹配项,回溯(NFA 类型")正则表达式引擎必须在找到一个匹配项后继续检查替代项.因此,实施者偷工减料也就不足为奇了.显然,python 的行为不符合标准,因为它无法找到最长的匹配项.根据 sed 手册页,出于性能原因",sed 也并不总是符合要求.但很明显,这个案例是正确的.

顺便说一句,您的命令并不完全等效:re.sub 将执行尽可能多的替换,而 sed 的 s/a/b/ 只会执行一次.sed 版本应该是:

echo "aabb" |sed "s/a*\(\(ab\)*\)b/\1/g"

这解释了为什么我们在python中得到空字符串:RE第一次匹配aab,第二次匹配剩余的b,删除每个部分(因为它是全部由 a* 和正则表达式的最后 b 匹配).您可以通过以下变体看到这一点:

<预><代码>>>>re.sub("a*((ab)*)b", r"X\1Y", "aabb")'XYXY'

I apologize if this is published somewhere, but my cursory searching didn't find anything.

While doing some Python programming I noticed that the following command:

re.sub("a*((ab)*)b", r"\1", "aabb")

returns the empty string. But an equivalent command in sed:

echo "aabb" | sed "s/a*\(\(ab\)*\)b/\1/"

returns ab.

It makes sense to me that the "a*" directive at the beginning of the python regex would match both a's, causing "(ab)*" to match zero times, but I have no idea how sed comes up with ab. Does anybody know what the difference is between the two regex engines that causes this? I believe they both match stars greedily by default, but it occurred to me that sed might match from the right rather than the left. Any insight would be greatly appreciated.

解决方案

Interesting puzzle you've constructed. From what I've read, the regexp engines of both python and sed are based on Henry Spencer's regex library (as is perl's), which relies on backtracking. (Unfortunately I can't find the article I'm basing this on).

Anyway, this is not something that's supposed to be an implementation detail: Python's behavior goes against the POSIX standard, which requires REs to (a) match at the earliest possible point, and (b) match the longest possible string that starts at that point. (See man 7 regex (on Linux) for this and a whole lot more.)

To find the longest match, a backtracking ("NFA-type") regex engine must continue examining alternatives after it finds one match. So it's not surprising that the implementers cut corners. Obviously, python's behavior is non-conforming since it fails to find the longest match. According to the sed manual page, sed doesn't always conform either, "for performance reasons". But obviously it gets this case right.

Incidentally, your commands are not fully equivalent: re.sub will perform a substitution as many times as possible, while sed's s/a/b/ will only perform it once.The sed version should have been:

echo "aabb" | sed "s/a*\(\(ab\)*\)b/\1/g"

This explains why we get the empty string in python: The RE matches aab the first time and the remaining b the second time, removing each part (since it's all matched by a* and the final b of the regexp). You can see this by the following variant:

>>> re.sub("a*((ab)*)b", r"X\1Y", "aabb")
'XYXY'

这篇关于sed 和 python 正则表达式不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆