令人惊讶但正确的贪婪子表达式在积极的后视断言中的行为 [英] Surprising, but correct behavior of a greedy subexpression in a positive lookbehind assertion

查看:57
本文介绍了令人惊讶但正确的贪婪子表达式在积极的后视断言中的行为的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:

  • 观察到的行为正确,但起初可能令人惊讶;对我来说是这样,我认为对其他人也可能是这样 - 尽管对于那些非常熟悉正则表达式引擎的人来说可能不是这样.

  • The observed behavior is correct, but may at first be surprising; it was to me, and I think it may be to others as well - though probably not to those intimately familiar with regex engines.

重复建议的重复项,正则表达式前瞻、后视和原子组,包含关于环视断言的一般信息,但没有解决手头的具体误解,如下面的评论中更详细地讨论.

The repeatedly suggested duplicate, Regex lookahead, lookbehind and atomic groups, contains general information about look-around assertions, but does not address the specific misconception at hand, as discussed in more detail in the comments below.

使用 greedy,根据定义 variable-width积极的后视断言可以表现出令人惊讶的行为.

Using a greedy, by definition variable-width subexpression inside a positive look-behind assertion can exhibit surprising behavior.

为方便起见,示例使用 PowerShell,但该行为通常适用于 .NET 正则表达式引擎:

The examples use PowerShell for convenience, but the behavior applies to the .NET regex engine in general:

这个命令按我的直觉运行:

This command works as I intuitively expect:

# OK:  
#     The subexpression matches greedily from the start up to and
#     including the last "_", and, by including the matched string ($&) 
#     in the replacement string, effectively inserts "|" there - and only there.
PS> 'a_b_c' -replace '^.+_', '$&|'
a_b_|c

以下命令使用了肯定的后视断言,(?<=...)表面上是等效的 - 但 isn't:

The following command, which uses a positive look-behind assertion, (?<=...), is seemingly equivalent - but isn't:

# CORRECT, but SURPRISING:
#   Use a positive lookbehind assertion to *seemingly* match
#   only up to and including the last "_", and insert a "|" there.
PS> 'a_b_c' -replace '(?<=^.+_)', '|'
a_|b_|c  # !! *multiple* insertions were performed

为什么不等价?为什么要执行多次插入?

Why isn't it equivalent? Why were multiple insertions performed?

推荐答案

tl;dr:

  • 后视断言中,贪婪子表达式实际上表现非贪婪(在全局匹配除了贪婪的行为),由于考虑输入字符串的每个前缀字符串.
  • Inside a look-behind assertion, a greedy subexpression in effect behaves non-greedily (in global matching in addition to acting greedily), due to considering every prefix string of the input string.

我的问题是我没有考虑到,在后视断言中,必须检查输入字符串中每个字符位置的前面的文本到那个点匹配lookbehind断言中的子表达式.

My problem was that I hadn't considered that, in a look-behind assertion, each and every character position in the input string must be checked for the preceding text up to that point to match the subexpression in the lookbehind assertion.

这与 PowerShell 的 -replace 运算符执行的始终全局替换(即执行 所有 可能的匹配)相结合,导致 multiple 插入:

This, combined with the always-global replacement that PowerShell's -replace operator performs (that is, all possible matches are performed), resulted in multiple insertions:

也就是说,当考虑左边的文本时,贪婪的、锚定的子表达式^.+_合法地匹配了两次当前正在考虑的角色位置:

That is, the greedy, anchored subexpression ^.+_ legitimately matched twice, when considering the text to the left of the character position currently being considered:

  • 首先,当 a_ 是左边的文本时.
  • a_b_ 是左边的文本时.
  • First, when a_ was the text to the left.
  • And again when a_b_ was the text to the left.

因此,两次插入了|.

相比之下,没有后视断言,贪婪表达式^.+_根据定义只匹配一次,直到last _,因为它只应用于整个输入字符串.

By contrast, without a look-behind assertion, greedy expression ^.+_ by definition only matches once, through to the last _, because it is only applied to the entire input string.

这篇关于令人惊讶但正确的贪婪子表达式在积极的后视断言中的行为的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆