递归 PHP 正则表达式 [英] Recursive PHP Regex

查看:32
本文介绍了递归 PHP 正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我选择了 ridgerunner 的答案,因为它包含解决问题所需的信息.但我也想为特定问题添加一个完全充实的解决方案,以防其他人也想完全理解这个例子.你会在下面的某个地方找到它.

这个问题是关于澄清 php 的递归表达式的正则表达式引擎的行为.(如果您想知道如何在不使用递归 php 正则表达式的情况下正确匹配下面的字符串,那非常酷,但这不是问题.)

This question is about clarifying the behavior of php's regex engine for recursive expressions. (If you ideas for how to properly match the strings below without using recursive php regex, that's very cool, but that's not the question.)

a(?:(?R)|a?)a

这是一个简单的表达式,旨在匹配字符a"或什么都没有,嵌套在字符a"的一个或多个嵌套中.例如,aa、aaa、aaaa、aaaaa.您不需要为此使用递归:

This is a simple expression that aims to match the character "a" or nothing, nested in one or multiple nests of the character "a". For instance, aa, aaa, aaaa, aaaaa. You don't need to use recursion for this:

aa*a

会很好用.但重点是使用递归.

would work great. But the point is to use recursion.

这是一段代码,您可以运行它来测试我的失败模式:

Here is a piece of code you can run to test my failing pattern:

<?php
$tries=array('a','aa','aaa','aaaa','aaaaa','aaaaaa');
$regex='#a(?:(?R)|a?)a#';
foreach ($tries as $try) {
echo $try." : ";
if (preg_match($regex,$try,$hit)) echo $hit[0]."<br />";
else echo 'no match<br />';
}
?>

在模式中,两个a"构成一个交替.在交替中,我们要么匹配整个模式的递归(两个a"构成一个交替),要么匹配字符a",可选为空.

In the pattern, two "a"s are framing an alternation. In the alternation, we either match a recursion of the whole pattern (two "a"s framing an alternation), or the character "a", optionally empty.

在我看来,对于aaaa",这应该匹配aaaa".

In my mind, for "aaaa", this should match "aaaa".

但这里是输出:

a : no match
aa : aa
aaa : aaa
aaaa : aaa
aaaaa : aaaaa
aaaaaa : aaa

有人可以解释输出的第三行和第五行发生了什么吗?我曾尝试追踪我想象的引擎必须走的路径,但我一定是想象错了.为什么引擎返回aaa"作为aaaa"的匹配项?是什么让它如此渴望?我一定是在以错误的顺序想象匹配的树.

Can someone explain what is happening on the third and fifth lines of output? I have tried tracing the path that I imagine the engine must be taking, but I must be imagining it wrong. Why is the engine returning "aaa" as a match for "aaaa"? What makes it so eager? I must be imagining the matching tree in the wrong order.

我意识到

#(?:a|a(?R)a)*#

有点工作,但我的问题是为什么其他模式没有.

kind of works, but my question is why the other pattern doesn't.

多谢!

推荐答案

优秀(且困难)的问题!

Excellent (and difficult) question!

首先,使用 PCRE 正则表达式引擎,(?R) 的行为就像一个原子组(与 Perl 不同?).一旦匹配(或不匹配),递归调用中发生的匹配就是最终匹配(并且递归调用中保存的所有回溯面包屑都将被丢弃).但是,正则表达式引擎确实保存了与整个 (?R) 表达式匹配的内容,并且可以将其返回并尝试其他替代方法以实现整体匹配.为了描述正在发生的事情,让我们稍微更改您的示例,以便更容易讨论和跟踪每一步匹配的内容.而不是:aaaa 作为主题文本,让我们使用:abcd.让我们将正则表达式从 '#a(?:(?R)|a?)a#' 更改为:'#.(?:(?R)|.?).#'.正则引擎匹配行为相同.

First, with the PCRE regex engine, the (?R) behaves like an atomic group (unlike Perl?). Once it matches (or doesn't match), the matching that happened inside the recursive call is final (and all backtracking breadcrumbs saved within the recursive call are discarded). However, the regex engine does save what was matched by the whole (?R) expression, and can give it back and try the other alternative to achieve an overall match. To describe what is happening, lets change your example slightly so that it will be easier to talk about and keep track of what is being matched at each step. Instead of: aaaa as the subject text, lets use: abcd. And lets change the regex from '#a(?:(?R)|a?)a#' to: '#.(?:(?R)|.?).#'. The regex engine matching behavior is the same.

answer = r'''
Step Depth Regex          Subject  Comment
1    0     .(?:(?R)|.?).  abcd     Dot matches "a". Advance pointers.
           ^              ^
2    0     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 1).
                 ^         ^
3    1     .(?:(?R)|.?).  abcd     Dot matches "b". Advance pointers.
           ^               ^
4    1     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 2).
                 ^          ^
5    2     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
           ^                ^
6    2     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 3).
                 ^           ^
7    3     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
           ^                 ^
8    3     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 4).
                 ^            ^
9    4     .(?:(?R)|.?).  abcd     Dot fails to match end of string.
           ^                  ^    DEPTH 4 (?R) FAILS. Return to step 8 depth 3.
                                   Give back text consumed by depth 4 (?R) = ""
10   3     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches EOS.
                    ^         ^    Advance regex pointer.
11   3     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    DEPTH 3 (?R) FAILS. Return to step 6 depth 2
                                   Give back text consumed by depth3 (?R) = "d"
12   2     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "d".
                    ^        ^     Advance pointers.
13   2     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    Backtrack to step 12 depth 2
14   2     .(?:(?R)|.?).  abcd     Match zero "d" (give it back).
                    ^        ^     Advance regex pointer.
15   2     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
                       ^     ^     DEPTH 2 (?R) SUCCEEDS.
                                   Return to step 4 depth 1
16   1     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    Backtrack to try other alternative. Give back
                                    text consumed by depth 2 (?R) = "cd"
17   1     .(?:(?R)|.?).  abcd     Optional dot matches "c". Advance pointers.
                    ^       ^      
18   1     .(?:(?R)|.?).  abcd     Required dot matches "d". Advance pointers.
                       ^     ^     DEPTH 1 (?R) SUCCEEDS.
                                   Return to step 2 depth 0
19   0     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                       ^      ^    Backtrack to try other alternative. Give back
                                    text consumed by depth 1 (?R) = "bcd"
20   0     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "b".
                    ^      ^       Advance pointers.
21   0     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
                       ^    ^      SUCCESSFUL MATCH of "abc"
'''

正则表达式引擎没有任何问题.正确的匹配是 abc(或对于原始问题的 aaa.)可以为所讨论的另一个较长的结果字符串执行类似(尽管更长)的步骤序列.

There is nothing wrong with the regex engine. The correct match is abc (or aaa for the original question.) A similar (albeit much longer) sequence of steps can be made for the other longer result string in question.

这篇关于递归 PHP 正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆