嵌套捕获组如何在正则表达式中编号? [英] How are nested capturing groups numbered in regular expressions?

查看:147
本文介绍了嵌套捕获组如何在正则表达式中编号?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

正则表达式应如何处理嵌套括号的捕获行为是否存在已定义的行为?更具体地说,您是否可以合理地期望不同的引擎将捕获第一个位置的外括号,并在后续位置捕获括号?



考虑以下PHP代码(使用PCRE)正则表达式)

 <?php 
$ test_string ='我想测试子模式';
preg_match('{(I(want)(to)test)sub(patterns)}',$ test_string,$ matches);
print_r($ matches);
?>

数组

[0] =>我想测试子模式//整个模式
[1] =>我想测试//整个外括号
[2] =>想要//第一个内部
[3] =>到//第二个内部
[4] =>模式//下一个括号设置

首先捕获整个带括号的表达式(我想测试),然后是接下来捕获内部带括号的模式(想要和到)。这有逻辑意义,但我可以看到一个同样合乎逻辑的情况,首先捕获子括号,然后捕获整个模式。



所以,这是捕获整个事情首先在正则表达式引擎中定义行为,还是依赖于模式的上下文和/或引擎的行为(PCRE与C#的不同,不同于Java的不同等等)? / p>

解决方案

来自 perlrequick


如果正则表达式中的分组是
嵌套,则$ 1获取该组
最左边的左括号,
$ b下一个左括号等。


警告:排除非捕获组左括号(?=)



更新



<我不是你很多,因为我通常使用真实的东西;),但 PCRE的文档显示相同作为Perl's:


SUBPATTERNS





2. 它将子模式设置为捕获子模式。这意味着,当整个模式匹配时,主题字符串中与子模式匹配的那部分将通过 ovector 参数传递回调用者pcre_exec()。打开括号从左到右计数(从1开始)以获取捕获子模式的编号。



例如,如果字符串the red king匹配模式

 ((红色|白色)(国王|女王))

捕获的子串是red king,red和king,分别编号为1,2和3。


如果PCRE偏离Perl正则表达式兼容性,也许应该重新定义首字母缩略词 - Perl Cognate Regular Expressions,Perl Comparable Regular表达或某事。或者只是剥离意义字母。


Is there a defined behavior for how regular expressions should handle the capturing behavior of nested parentheses? More specifically, can you reasonably expect that different engines will capture the outer parentheses in the first position, and nested parentheses in subsequent positions?

Consider the following PHP code (using PCRE regular expressions)

<?php
  $test_string = 'I want to test sub patterns';
  preg_match('{(I (want) (to) test) sub (patterns)}', $test_string, $matches);
  print_r($matches);
?>

Array
(
    [0] => I want to test sub patterns  //entire pattern
    [1] => I want to test           //entire outer parenthesis
    [2] => want             //first inner
    [3] => to               //second inner
    [4] => patterns             //next parentheses set
)

The entire parenthesized expression is captured first (I want to test), and then the inner parenthesized patterns are captured next ("want" and "to"). This makes logical sense, but I could see an equally logical case being made for first capturing the sub parentheses, and THEN capturing the entire pattern.

So, is this "capture the entire thing first" defined behavior in regular expression engines, or is it going to depend on the context of the pattern and/or the behavior of the engine (PCRE being different than C#'s being different than Java's being different than etc.)?

解决方案

From perlrequick

If the groupings in a regex are nested, $1 gets the group with the leftmost opening parenthesis, $2 the next opening parenthesis, etc.

Caveat: Excluding non-capture group opening parenthesis (?=)

Update

I don't use PCRE much, as I generally use the real thing ;), but PCRE's docs show the same as Perl's:

SUBPATTERNS

2. It sets up the subpattern as a capturing subpattern. This means that, when the whole pattern matches, that portion of the subject string that matched the subpattern is passed back to the caller via the ovector argument of pcre_exec(). Opening parentheses are counted from left to right (starting from 1) to obtain number for the capturing subpatterns.

For example, if the string "the red king" is matched against the pattern

the ((red|white) (king|queen))

the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.

If PCRE is drifting away from Perl regex compatibility, perhaps the acronym should be redefined--"Perl Cognate Regular Expressions", "Perl Comparable Regular Expressions" or something. Or just divest the letters of meaning.

这篇关于嵌套捕获组如何在正则表达式中编号?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆