Python re.sub 使用非贪婪模式 (.*?) 以字符串结尾 ($) 它来贪婪! [英] Python re.sub use non-greedy mode (.*?) with end of string ($) it comes greedy!

查看:119
本文介绍了Python re.sub 使用非贪婪模式 (.*?) 以字符串结尾 ($) 它来贪婪!的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

代码:

str = '<br><br />A<br />B'
print(re.sub(r'<br.*?>\w$', '', str))

本应返回<br><br/>A,但返回一个空字符串''

It is expected to return <br><br />A, but it returns an empty string ''!

有什么建议吗?

推荐答案

贪婪是从左到右起作用的,但不是相反.它的基本意思是除非匹配失败,否则不匹配".这是发生了什么:

Greediness works from left to right, but not otherwise. It basically means "don't match unless you failed to match". Here's what's going on:

  1. 正则表达式引擎匹配字符串开头的 .
  2. .*? 暂时忽略了,很懒.
  3. 尝试匹配>,并成功.
  4. 尝试匹配 \w 并失败.现在很有趣 - 引擎开始回溯,并看到 .*? 规则.在这种情况下,. 可以匹配第一个 >,所以仍然有希望匹配.
  5. 这种情况一直发生,直到正则表达式到达斜线为止.然后 >\w 可以匹配,但 $ 失败.引擎再次回到惰性 .* 规则,并保持匹配,直到匹配

    A
    B
  1. The regex engine matches <br at the start of the string.
  2. .*? is ignored for now, it is lazy.
  3. Try to match >, and succeeds.
  4. Try to match \w and fails. Now it's interesting - the engine starts backtracking, and sees the .*? rule. In this case, . can match the first >, so there's still hope for that match.
  5. This keep happening until the regex reaches the slash. Then >\w can match, but $ fails. Again, the engine comes back to the lazy .* rule, and keeps matching, until it matches<br><br />A<br />B

幸运的是,有一个简单的解决方案:通过替换 <br[^>]*>\w$ 你不会允许在你的标签之外匹配,所以它应该替换最后一次出现.
严格来说,这不适用于 HTML,因为标签属性可以包含 > 字符,但我认为这只是一个示例.

Luckily, there's an easy solution: By replacing <br[^>]*>\w$ you don't allow matching outside of your tags, so it should replace the last occurrence.
Strictly speaking, this doesn't work well for HTML, because tag attributes can contain > characters, but I assume it's just an example.

这篇关于Python re.sub 使用非贪婪模式 (.*?) 以字符串结尾 ($) 它来贪婪!的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆