正则表达式中间的前瞻不匹配 [英] lookahead in the middle of regex doesn't match

查看:40
本文介绍了正则表达式中间的前瞻不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串 $s1 = "a_b"; 我想匹配这个字符串但只捕获字母.我尝试使用前瞻:

if($s1 =~/([a-z])(?=_)([a-z])/){print "Captured: $1, $2\n";}

但这似乎与我的字符串不匹配.我已经通过使用 (?:_) 解决了原来的问题,但我很好奇为什么我最初的尝试不起作用?据我了解,前瞻匹配但不捕获,那么我做错了什么?

解决方案

先行查找下一个直接位置,如果发生真断言,则回溯到前一个匹配 - 紧跟 a 之后 - 到继续匹配.只有在正前瞻 ([a-z])(?=_)_([a-z])

旁边带一个 _ 时,您的正则表达式才会起作用

您甚至不需要(非)捕获组进行替换:

if ($s1 =~/([a-z])_([a-z])/) { print "Captured: $1, $2\n";}

编辑

回复@Borodin 的评论

我认为向后移动与回溯相同,通过调试整个事物(Perl 调试模式)更容易识别:

匹配 REx "a(?=_)_b" 和 "a_b"...0 <><a_b>|0|1:精确<a>(3)1<a><_b>|0|3:IFMATCH[0](9)1<a><_b>|1|5:精确<_>(7)2 <a_><b>|1|7:成功(0)|1|子模式成功...1<a><_b>|0|9:精确<_b>(11)3 <a_b><>|0|11:结束(0)匹配成功!

如上调试输出显示在结果的第四行(当第 3 步发生时)引擎消耗字符 a_(在前瞻断言中)然后我们看到在成功断言后发生回溯正向前瞻,引擎以相反的方式跳过整个子模式并从 a 之后的位置开始.

在第 5 行,引擎只消耗了一个字符:a.Regex101 调试器:

我如何解释这个回溯在这个插图中更清楚(感谢@JDB,我借用了他的

所以我认为这是一种回溯或某种,但是如果我对所有这些说法有误,那么我会很感激任何张开双臂的回收.

I have a string $s1 = "a_b"; and I want to match this string but only capture the letters. I tried to use a lookahead:

if($s1 =~ /([a-z])(?=_)([a-z])/){print "Captured: $1, $2\n";}

but this does not seem to match my string. I have solved the original problem by using a (?:_)instead, but I am curious to why my original attempt did not work? To my understanding a lookahead matches but do not capture, so what did I do wrong?

解决方案

A lookahead looks for next immediate positions and if a true-assertion takes place it backtracks to previous match - right after a - to continue matching. Your regex would work only if you bring a _ next to the positive lookahead ([a-z])(?=_)_([a-z])

You even don't need (non-)capturing groups in substitution:

if ($s1 =~ /([a-z])_([a-z])/) { print "Captured: $1, $2\n"; }

Edit

In reply to @Borodin's comment

I think that moving backwards is the same as a backtrack which is more recognizable by debugging the whole thing (Perl debug mode):

Matching REx "a(?=_)_b" against "a_b"
.
.
.
   0 <> <a_b>                |   0| 1:EXACT <a>(3)
   1 <a> <_b>                |   0| 3:IFMATCH[0](9)
   1 <a> <_b>                |   1|  5:EXACT <_>(7)
   2 <a_> <b>                |   1|  7:SUCCEED(0)
                             |   1|  subpattern success...
   1 <a> <_b>                |   0| 9:EXACT <_b>(11)
   3 <a_b> <>                |   0| 11:END(0)
Match successful!

As above debug output shows at forth line of results (when 3rd step took place) engine consumes characters a_ (while being in a lookahead assertion) and then we see a backtrack happens after successful assertion of positive lookahead, engine skips whole sub-pattern in a reverse manner and starts at the position right after a.

At line #5, engine has consumed one character only: a. Regex101 debugger:

How I interpret this backtrack is more clear in this illustration (Thanks to @JDB, I borrowed his style of representation)

a(?=_)_b
*
|\
| \
|  : a (match)
|  * (?=_)
|  |↖
|  | ↖
|  |↘ ↖
|  | ↘ ↖
|  |  ↘ ↖
|  |   : _ (match)
|  |     ^ SUBPATTERN SUCCESS (OP_ASSERT :=> MATCH_MATCH)
|  * _b
|  |\
|  | \
|  |  : _ (match)
|  |  : b (match)
|  | /
|  |/
| /
|/
MATCHED

By this I mean if lookahead assertion succeeds - since extraction of parts of input string is happened - it goes back upward (back to previous match offset - (eptr (pointer into the subject) is not changed but offset is) and while resetting consumed chars it tries to continue matching from there and I call it a backtrack. Below is a visual representation of steps taken by engine with use of Regexp::Debugger

So I see it a backtrack or a kind of, however if I'm wrong with all these said, then I'd appreciate any reclaims with open arms.

这篇关于正则表达式中间的前瞻不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆