为什么重复捕获组会返回这些字符串? [英] Why Does a Repeated Capture Group Return these Strings?

查看:31
本文介绍了为什么重复捕获组会返回这些字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人可以解释为什么以下返回 'cc' 吗?

<预><代码>>>>re.match('(..)+', 'aabbcc').group(1)'抄送'

有人告诉我,因为它将每个匹配项放入组 (1),所以最后一个匹配项是cc".是真的吗?

那下面怎么解释?

<预><代码>>>>re.match('(..)+(...)', 'aabbcc').group(1)'啊'

解决方案

重复捕获组:组号保持不变

(..) 定义的组是组 1.+ 量词重复它.每次引擎能够重复组(匹配两个字符)时,组 1 就会被覆盖.

  • 当引擎开始匹配时,它会捕获aa到Group 1
  • 然后将 bb 捕获到 Group 1
  • 然后将 cc 捕获到组 1.

当您检查 Group 1 时,引擎返回 cc.所有其他捕获都将丢失.

(例外是 .NET 引擎,它也返回 cc,但由于 CaptureCollection 对象,它还允许您检查中间捕获.它将包含 aabbcc.)

使用(..)+(...),为什么第1组包含aa?回溯!

要理解这一点,我们再次需要遵循正则表达式引擎的路径.

  • 再一次,当引擎开始匹配时,它会将 aa 捕获到 Group 1
  • 再次,它重复 (..) 组并将 bb 捕获到组 1
  • 再次重复 (..) 组并将 cc 捕获到组 1
  • 引擎现在尝试匹配 (...).它失败了:没有剩余的字符可以使用.
  • 引擎在字符串和正则表达式模式中回溯.+表示1次或多次,我们匹配了3次..,所以我们可以放弃一个,甚至两个.在这个阶段,引擎放弃量化的(..)+ 组的最后一个匹配,即cc.我们又回到了第 1 组 bb 的时候.
  • 引擎再次尝试匹配 (...).只剩下两个字符:cc,所以又失败了.
  • 引擎回溯,放弃量化的(..)+组的最后一个匹配,即bb.在这个阶段,第 1 组又是 aa.
  • 引擎再次尝试匹配 (...).它成功了:第 2 组是 bbc,第 1 组是 aa

参考

Can someone explain why following returns 'cc'?

>>> re.match('(..)+', 'aabbcc').group(1)
'cc'

I was told that because it put each match into group(1), so the last match is 'cc'. Is that true?

Then how to explain following?

>>> re.match('(..)+(...)', 'aabbcc').group(1)
'aa'

解决方案

Repeated Capture Group: The Group Number Stays the Same

The group defined by (..) is Group 1. The + quantifier repeats it. Every time the engine is able to repeat the group (matching two characters), Group 1 gets overwritten.

  • When the engine starts to match, it captures aa to Group 1
  • It then captures bb to Group 1
  • It then captures cc to Group 1.

When you inspect Group 1, the engine returns cc. All other captures are lost.

(The exception is the .NET engine, which also returns cc but also allows you to inspect intermediate captures thanks to the CaptureCollection object. It would contain aa, bb and cc.)

With (..)+(...), Why does Group 1 Contain aa? Backtracking!

To understand this, we again need to follow the path of the regex engine.

  • Once again, when the engine starts to match, it captures aa to Group 1
  • Again, it repeats the (..) group and captures bb to Group 1
  • Again, it repeats the (..) group and captures cc to Group 1
  • The engine now tries to match (...). It fails: there are no characters left to consume.
  • The engine backtracks both in the string and in the regex pattern. The + means one or more times, and we matched .. three times, so we can give one up, or even two. At this stage, the engine gives up the last match of the quantified (..)+ group, which is cc. We are back to when Group 1 was bb.
  • The engine tries to match (...) again. There are only two characters left: cc, so it fails again.
  • The engine backtracks by giving up the last match of the quantified (..)+ group, which is bb. At this stage, Group 1 is aa again.
  • The engine tries to match (...) again. It succeeds: Group 2 is bbc, and Group 1 is aa

Reference

这篇关于为什么重复捕获组会返回这些字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆