使用Regex匹配嵌套模式(使用PHP的递归) [英] Matching nested Patterns with Regex (using PHP's recursion)

查看:82
本文介绍了使用Regex匹配嵌套模式(使用PHP的递归)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在尝试用PHP编写一个正则表达式,该表达式允许我匹配包含不确定嵌套的自身的特定模式.我知道默认情况下,正则表达式无法执行此操作,但是PHP的递归模式( http://php.net/manual/de/regexp.reference.recursive.php )应该可以实现.

I am currently trying to write a regular expression in PHP that allows me to match a specific pattern containing itself indefinetely nested. I know that per default regular expressions are not capable of doing that, but PHP's Recursive Patterns (http://php.net/manual/de/regexp.reference.recursive.php) should make it possible.

我有这样的嵌套结构:

<a=5>
    <a=3>
        Foo
        <b>Bar</b>
    </a>
    Baz
</a>

现在,我想匹配最外层标签的内容.为了正确地将第一个开始标记与最后一个结束标记匹配,我需要PHP的递归项(?R).

Now I want to match the content of the outmost tag. In order to correctly match up the first opening tag with the last closing tag, I need PHP's recursion item (?R).

我尝试了这样的模式:

/<a=5>((?R)|[^<]|<\/?[^a]|<\/?a[a-zA-Z0-9-])*<\/a>/s

基本上是指<​​c1>,其后依次是尽可能多的以下内容,然后是</a>:

  • 另一个标签(递归)
  • 任何未打开标签的字符
  • 任何开始标记,后跟一个可选的斜杠,而不是一个"a"
  • 之前带a的字符,但未完成(至少还有1个字符)

最后2种情况可能只是一种情况[标签不是namend"a"],但是我听说应该在正则表达式中避免这种情况,因为它需要环顾四周,并且性能会很差.

The last 2 cases could be just one case [tag not namend "a"], but I heard this should be avoided in regular expressions, because it needs lookarounds and would have bad performance.

但是,我在RegEx中看不到任何错误,但是它与给定的字符串不匹配.我想要以下比赛:

However, I see no mistake in my RegEx, but it does not match the given string. I want the following match:

    <a=3>
        Foo
        <b>Bar</b>
    </a>
    Baz


以下是使用RegEx的链接: https://www.regex101.com/r/lO1wA6/1

推荐答案

您可以使用此正则表达式来匹配所需内容(为方便起见,将正则表达式放在字符串文字中):

You can use this regex to match what you want (the regex placed in a string literal for sake of convenience):

'~<a=5>(<([a-zA-Z0-9]+)[^>]*>(?1)*</\2>|[^<>]++)*</a>~'

这是上面的正则表达式的分解:

Here is a break down of the regex above:

<a=5>
(
  <([a-zA-Z0-9]+)[^>]*>
  (?1)*
  </\2>
  |
  [^<>]++
)*
</a>

第一部分<([a-zA-Z0-9]+)[^>]*>(?1)*</\2>匹配一对匹配的标记及其所有内容.假定标签名称由字符[a-zA-Z0-9]组成.匹配结束标记</\2>时,捕获标记的名称([a-zA-Z0-9]+)和向后引用.

The first part <([a-zA-Z0-9]+)[^>]*>(?1)*</\2> matches pair of matching tags and all its content. It assumes that the name of the tag consists of the characters [a-zA-Z0-9]. The name of the tag is captured ([a-zA-Z0-9]+) and backreference when matching the closing tag </\2>.

第二部分[^<>]++与标记之外的其他任何内容匹配.请注意,没有对带引号的字符串进行处理,因此根据您的输入,它可能不起作用.

The second part [^<>]++ matches whatever else outside the tags. Note that there is no handling of quoted string, so depending on your input it may not work.

然后返回到例程调用,该例程递归地调用第一个捕获组.您会注意到一个标签可以包含0个或多个其他标签或非标签内容的实例.由于正则表达式的编写方式,该属性也由最外面的<a=5>...</a>对共享.

Then back to the routine call which recursively calls the first capturing group. You would notice that a tag can contain 0 or more instances of other tags or non-tag contents. Due to the way the regex is written, this property is also shared by the outer most <a=5>...</a> pair.

在regex101上进行演示

这篇关于使用Regex匹配嵌套模式(使用PHP的递归)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆