使用Regex匹配嵌套模式(使用PHP的递归) [英] Matching nested Patterns with Regex (using PHP's recursion)
问题描述
我目前正在尝试用PHP编写一个正则表达式,该表达式允许我匹配包含不确定嵌套的自身的特定模式.我知道默认情况下,正则表达式无法执行此操作,但是PHP的递归模式( http://php.net/manual/de/regexp.reference.recursive.php )应该可以实现.
I am currently trying to write a regular expression in PHP that allows me to match a specific pattern containing itself indefinetely nested. I know that per default regular expressions are not capable of doing that, but PHP's Recursive Patterns (http://php.net/manual/de/regexp.reference.recursive.php) should make it possible.
我有这样的嵌套结构:
<a=5>
<a=3>
Foo
<b>Bar</b>
</a>
Baz
</a>
现在,我想匹配最外层标签的内容.为了正确地将第一个开始标记与最后一个结束标记匹配,我需要PHP的递归项(?R)
.
Now I want to match the content of the outmost tag. In order to correctly match up the first opening tag with the last closing tag, I need PHP's recursion item (?R)
.
我尝试了这样的模式:
/<a=5>((?R)|[^<]|<\/?[^a]|<\/?a[a-zA-Z0-9-])*<\/a>/s
基本上是指<c1>,其后依次是尽可能多的以下内容,然后是</a>
:
- 另一个标签(递归)
- 任何未打开标签的字符
- 任何开始标记,后跟一个可选的斜杠,而不是一个"a"
- 之前带a的字符,但未完成(至少还有1个字符)
最后2种情况可能只是一种情况[标签不是namend"a"],但是我听说应该在正则表达式中避免这种情况,因为它需要环顾四周,并且性能会很差.
The last 2 cases could be just one case [tag not namend "a"], but I heard this should be avoided in regular expressions, because it needs lookarounds and would have bad performance.
但是,我在RegEx中看不到任何错误,但是它与给定的字符串不匹配.我想要以下比赛:
However, I see no mistake in my RegEx, but it does not match the given string. I want the following match:
<a=3>
Foo
<b>Bar</b>
</a>
Baz
以下是使用RegEx的链接: https://www.regex101.com/r/lO1wA6/1
推荐答案
您可以使用此正则表达式来匹配所需内容(为方便起见,将正则表达式放在字符串文字中):
You can use this regex to match what you want (the regex placed in a string literal for sake of convenience):
'~<a=5>(<([a-zA-Z0-9]+)[^>]*>(?1)*</\2>|[^<>]++)*</a>~'
这是上面的正则表达式的分解:
Here is a break down of the regex above:
<a=5>
(
<([a-zA-Z0-9]+)[^>]*>
(?1)*
</\2>
|
[^<>]++
)*
</a>
第一部分<([a-zA-Z0-9]+)[^>]*>(?1)*</\2>
匹配一对匹配的标记及其所有内容.假定标签名称由字符[a-zA-Z0-9]
组成.匹配结束标记</\2>
时,捕获标记的名称([a-zA-Z0-9]+)
和向后引用.
The first part <([a-zA-Z0-9]+)[^>]*>(?1)*</\2>
matches pair of matching tags and all its content. It assumes that the name of the tag consists of the characters [a-zA-Z0-9]
. The name of the tag is captured ([a-zA-Z0-9]+)
and backreference when matching the closing tag </\2>
.
第二部分[^<>]++
与标记之外的其他任何内容匹配.请注意,没有对带引号的字符串进行处理,因此根据您的输入,它可能不起作用.
The second part [^<>]++
matches whatever else outside the tags. Note that there is no handling of quoted string, so depending on your input it may not work.
然后返回到例程调用,该例程递归地调用第一个捕获组.您会注意到一个标签可以包含0个或多个其他标签或非标签内容的实例.由于正则表达式的编写方式,该属性也由最外面的<a=5>...</a>
对共享.
Then back to the routine call which recursively calls the first capturing group. You would notice that a tag can contain 0 or more instances of other tags or non-tag contents. Due to the way the regex is written, this property is also shared by the outer most <a=5>...</a>
pair.
这篇关于使用Regex匹配嵌套模式(使用PHP的递归)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!