正则表达式中间的前瞻不匹配 [英] lookahead in the middle of regex doesn't match
问题描述
我有一个字符串 $s1 = "a_b";
我想匹配这个字符串但只捕获字母.我尝试使用前瞻:
if($s1 =~/([a-z])(?=_)([a-z])/){print "Captured: $1, $2\n";}
但这似乎与我的字符串不匹配.我已经通过使用 (?:_)
解决了原来的问题,但我很好奇为什么我最初的尝试不起作用?据我了解,前瞻匹配但不捕获,那么我做错了什么?
先行查找下一个直接位置,如果发生真断言,则回溯到前一个匹配 - 紧跟 a
之后 - 到继续匹配.只有在正前瞻 ([a-z])(?=_)_([a-z])
_
时,您的正则表达式才会起作用您甚至不需要(非)捕获组进行替换:
if ($s1 =~/([a-z])_([a-z])/) { print "Captured: $1, $2\n";}
编辑
回复@Borodin 的评论
我认为向后移动与回溯相同,通过调试整个事物(Perl 调试模式)更容易识别:
匹配 REx "a(?=_)_b" 和 "a_b"...0 <><a_b>|0|1:精确<a>(3)1<a><_b>|0|3:IFMATCH[0](9)1<a><_b>|1|5:精确<_>(7)2 <a_><b>|1|7:成功(0)|1|子模式成功...1<a><_b>|0|9:精确<_b>(11)3 <a_b><>|0|11:结束(0)匹配成功!
如上调试输出显示在结果的第四行(当第 3 步发生时)引擎消耗字符 a_
(在前瞻断言中)然后我们看到在成功断言后发生回溯正向前瞻,引擎以相反的方式跳过整个子模式并从 a
之后的位置开始.
在第 5 行,引擎只消耗了一个字符:a
.Regex101 调试器:
我如何解释这个回溯在这个插图中更清楚(感谢@JDB,我借用了他的
所以我认为这是一种回溯或某种,但是如果我对所有这些说法有误,那么我会很感激任何张开双臂的回收.
I have a string $s1 = "a_b";
and I want to match this string but only capture the letters. I tried to use a lookahead:
if($s1 =~ /([a-z])(?=_)([a-z])/){print "Captured: $1, $2\n";}
but this does not seem to match my string. I have solved the original problem by using a (?:_)
instead, but I am curious to why my original attempt did not work? To my understanding a lookahead matches but do not capture, so what did I do wrong?
A lookahead looks for next immediate positions and if a true-assertion takes place it backtracks to previous match - right after a
- to continue matching. Your regex would work only if you bring a _
next to the positive lookahead ([a-z])(?=_)_([a-z])
You even don't need (non-)capturing groups in substitution:
if ($s1 =~ /([a-z])_([a-z])/) { print "Captured: $1, $2\n"; }
Edit
In reply to @Borodin's comment
I think that moving backwards is the same as a backtrack which is more recognizable by debugging the whole thing (Perl debug mode):
Matching REx "a(?=_)_b" against "a_b"
.
.
.
0 <> <a_b> | 0| 1:EXACT <a>(3)
1 <a> <_b> | 0| 3:IFMATCH[0](9)
1 <a> <_b> | 1| 5:EXACT <_>(7)
2 <a_> <b> | 1| 7:SUCCEED(0)
| 1| subpattern success...
1 <a> <_b> | 0| 9:EXACT <_b>(11)
3 <a_b> <> | 0| 11:END(0)
Match successful!
As above debug output shows at forth line of results (when 3rd step took place) engine consumes characters a_
(while being in a lookahead assertion) and then we see a backtrack happens after successful assertion of positive lookahead, engine skips whole sub-pattern in a reverse manner and starts at the position right after a
.
At line #5, engine has consumed one character only: a
. Regex101 debugger:
How I interpret this backtrack is more clear in this illustration (Thanks to @JDB, I borrowed his style of representation)
a(?=_)_b
*
|\
| \
| : a (match)
| * (?=_)
| |↖
| | ↖
| |↘ ↖
| | ↘ ↖
| | ↘ ↖
| | : _ (match)
| | ^ SUBPATTERN SUCCESS (OP_ASSERT :=> MATCH_MATCH)
| * _b
| |\
| | \
| | : _ (match)
| | : b (match)
| | /
| |/
| /
|/
MATCHED
By this I mean if lookahead assertion succeeds - since extraction of parts of input string is happened - it goes back upward (back to previous match offset - (eptr
(pointer into the subject) is not changed but offset is) and while resetting consumed chars it tries to continue matching from there and I call it a backtrack. Below is a visual representation of steps taken by engine with use of Regexp::Debugger
So I see it a backtrack or a kind of, however if I'm wrong with all these said, then I'd appreciate any reclaims with open arms.
这篇关于正则表达式中间的前瞻不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!