递归PHP正则表达式 [英] Recursive PHP Regex

查看:84
本文介绍了递归PHP正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我选择了ridgerunner的答案,因为它包含解决问题所需的信息.但是我也想为特定问题添加一个完善的解决方案,以防其他人也想完全理解该示例.您会在下面找到它.

这个问题是关于为递归表达式阐明php的regex引擎的行为. (如果您想知道如何在不使用递归php regex的情况下正确匹配下面的字符串,那很酷,但这不是问题.)

This question is about clarifying the behavior of php's regex engine for recursive expressions. (If you ideas for how to properly match the strings below without using recursive php regex, that's very cool, but that's not the question.)

a(?:(?R)|a?)a

这是一个简单的表达式,旨在匹配字符"a"或什么都不匹配,嵌套在字符"a"的一个或多个嵌套中.例如,aa,aaa,aaaa,aaaaa.您无需为此使用递归:

This is a simple expression that aims to match the character "a" or nothing, nested in one or multiple nests of the character "a". For instance, aa, aaa, aaaa, aaaaa. You don't need to use recursion for this:

aa*a

会很好.但是关键是要使用递归.

would work great. But the point is to use recursion.

这是一段代码,您可以运行它来测试我的失败模式:

Here is a piece of code you can run to test my failing pattern:

<?php
$tries=array('a','aa','aaa','aaaa','aaaaa','aaaaaa');
$regex='#a(?:(?R)|a?)a#';
foreach ($tries as $try) {
echo $try." : ";
if (preg_match($regex,$try,$hit)) echo $hit[0]."<br />";
else echo 'no match<br />';
}
?>

在模式中,两个"a"构成交替.在交替中,我们要么匹配整个模式的递归(两个"a"构成交替),要么匹配字符"a"(可选为空).

In the pattern, two "a"s are framing an alternation. In the alternation, we either match a recursion of the whole pattern (two "a"s framing an alternation), or the character "a", optionally empty.

在我看来,对于"aaaa",应与"aaaa"匹配.

In my mind, for "aaaa", this should match "aaaa".

但这是输出:

a : no match
aa : aa
aaa : aaa
aaaa : aaa
aaaaa : aaaaa
aaaaaa : aaa

有人可以解释输出的第三行和第五行发生了什么吗? 我试图追踪我想象中的引擎必须走的路,但是我必须想象它是错误的.为什么引擎返回"aaa"作为"aaaa"的匹配项?是什么让它如此渴望?我必须以错误的顺序想象匹配的树.

Can someone explain what is happening on the third and fifth lines of output? I have tried tracing the path that I imagine the engine must be taking, but I must be imagining it wrong. Why is the engine returning "aaa" as a match for "aaaa"? What makes it so eager? I must be imagining the matching tree in the wrong order.

我意识到

#(?:a|a(?R)a)*#

各种各样的作品,但是我的问题是为什么其他模式没有?

kind of works, but my question is why the other pattern doesn't.

感谢堆!

推荐答案

一个很好的(也是困难的)问题!

Excellent (and difficult) question!

首先,使用PCRE正则表达式引擎,(?R)的行为类似于原子组(与Perl不同).一旦匹配(或不匹配),则在递归调用中发生的匹配是最终的(并且递归调用中保存的所有回溯面包屑都将被丢弃).但是,正则表达式引擎确实保存了整个(?R)表达式所匹配的内容,并且可以将其返回并尝试其他替代方法以实现整体匹配.为了描述正在发生的事情,让我们稍微更改一下示例,以便更轻松地讨论并跟踪每个步骤中要匹配的内容.代替:aaaa作为主题文本,让我们使用:abcd.然后将正则表达式从'#a(?:(?R)|a?)a#'更改为:'#.(?:(?R)|.?).#'. regex引擎的匹配行为是相同的.

First, with the PCRE regex engine, the (?R) behaves like an atomic group (unlike Perl?). Once it matches (or doesn't match), the matching that happened inside the recursive call is final (and all backtracking breadcrumbs saved within the recursive call are discarded). However, the regex engine does save what was matched by the whole (?R) expression, and can give it back and try the other alternative to achieve an overall match. To describe what is happening, lets change your example slightly so that it will be easier to talk about and keep track of what is being matched at each step. Instead of: aaaa as the subject text, lets use: abcd. And lets change the regex from '#a(?:(?R)|a?)a#' to: '#.(?:(?R)|.?).#'. The regex engine matching behavior is the same.

answer = r'''
Step Depth Regex          Subject  Comment
1    0     .(?:(?R)|.?).  abcd     Dot matches "a". Advance pointers.
           ^              ^
2    0     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 1).
                 ^         ^
3    1     .(?:(?R)|.?).  abcd     Dot matches "b". Advance pointers.
           ^               ^
4    1     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 2).
                 ^          ^
5    2     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
           ^                ^
6    2     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 3).
                 ^           ^
7    3     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
           ^                 ^
8    3     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 4).
                 ^            ^
9    4     .(?:(?R)|.?).  abcd     Dot fails to match end of string.
           ^                  ^    DEPTH 4 (?R) FAILS. Return to step 8 depth 3.
                                   Give back text consumed by depth 4 (?R) = ""
10   3     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches EOS.
                    ^         ^    Advance regex pointer.
11   3     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    DEPTH 3 (?R) FAILS. Return to step 6 depth 2
                                   Give back text consumed by depth3 (?R) = "d"
12   2     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "d".
                    ^        ^     Advance pointers.
13   2     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    Backtrack to step 12 depth 2
14   2     .(?:(?R)|.?).  abcd     Match zero "d" (give it back).
                    ^        ^     Advance regex pointer.
15   2     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
                       ^     ^     DEPTH 2 (?R) SUCCEEDS.
                                   Return to step 4 depth 1
16   1     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    Backtrack to try other alternative. Give back
                                    text consumed by depth 2 (?R) = "cd"
17   1     .(?:(?R)|.?).  abcd     Optional dot matches "c". Advance pointers.
                    ^       ^      
18   1     .(?:(?R)|.?).  abcd     Required dot matches "d". Advance pointers.
                       ^     ^     DEPTH 1 (?R) SUCCEEDS.
                                   Return to step 2 depth 0
19   0     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    Backtrack to try other alternative. Give back
                                    text consumed by depth 1 (?R) = "bcd"
20   0     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "b".
                    ^      ^       Advance pointers.
21   0     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
                       ^    ^      SUCCESSFUL MATCH of "abc"
'''

正则表达式引擎没有错.正确的匹配项是abc(对于原始问题,则为aaa.)对于所讨论的其他较长结果字符串,可以执行类似的步骤序列(尽管更长).

There is nothing wrong with the regex engine. The correct match is abc (or aaa for the original question.) A similar (albeit much longer) sequence of steps can be made for the other longer result string in question.

这篇关于递归PHP正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆