Python re.sub 使用非贪婪模式 (.*?) 以字符串结尾 ($) 它来贪婪! [英] Python re.sub use non-greedy mode (.*?) with end of string ($) it comes greedy!
问题描述
代码:
str = '<br><br />A<br />B'
print(re.sub(r'<br.*?>\w$', '', str))
本应返回<br><br/>A
,但返回一个空字符串''
!
It is expected to return <br><br />A
, but it returns an empty string ''
!
有什么建议吗?
推荐答案
贪婪是从左到右起作用的,但不是相反.它的基本意思是除非匹配失败,否则不匹配".这是发生了什么:
Greediness works from left to right, but not otherwise. It basically means "don't match unless you failed to match". Here's what's going on:
- 正则表达式引擎匹配字符串开头的
. .*?
暂时忽略了,很懒.- 尝试匹配
>
,并成功. - 尝试匹配
\w
并失败.现在很有趣 - 引擎开始回溯,并看到.*?
规则.在这种情况下,.
可以匹配第一个>
,所以仍然有希望匹配. - 这种情况一直发生,直到正则表达式到达斜线为止.然后
>\w
可以匹配,但$
失败.引擎再次回到惰性.*
规则,并保持匹配,直到匹配
A
B
- The regex engine matches
<br
at the start of the string. .*?
is ignored for now, it is lazy.- Try to match
>
, and succeeds. - Try to match
\w
and fails. Now it's interesting - the engine starts backtracking, and sees the.*?
rule. In this case,.
can match the first>
, so there's still hope for that match. - This keep happening until the regex reaches the slash. Then
>\w
can match, but$
fails. Again, the engine comes back to the lazy.*
rule, and keeps matching, until it matches<br><br />A<br />B
幸运的是,有一个简单的解决方案:通过替换 <br[^>]*>\w$
你不会允许在你的标签之外匹配,所以它应该替换最后一次出现.
严格来说,这不适用于 HTML,因为标签属性可以包含 >
字符,但我认为这只是一个示例.
Luckily, there's an easy solution: By replacing <br[^>]*>\w$
you don't allow matching outside of your tags, so it should replace the last occurrence.
Strictly speaking, this doesn't work well for HTML, because tag attributes can contain >
characters, but I assume it's just an example.
这篇关于Python re.sub 使用非贪婪模式 (.*?) 以字符串结尾 ($) 它来贪婪!的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!