正则表达式递归:第 N 个子模式 [英] Regex Recursion: Nth Subpatterns

查看:59
本文介绍了正则表达式递归:第 N 个子模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试了解正则表达式中的递归,并对 PCRE 风格中的概念有基本的了解.我想打破一个字符串:

I'm trying to learn about Recursion in Regular Expressions, and have a basic understanding of the concepts in the PCRE flavour. I want to break a string:

Geese (Flock) Dogs (Pack) 

进入:

Full Match: Geese (Flock) Dogs (Pack) 
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)

我知道这两个正则表达式都没有做到这一点,但我更好奇为什么 首先 模式有效,但 second 无效.

I know neither regex quite does this, but I was more curious as to the reason why the the first pattern works, but the second one doesn't.

Pattern 1: ((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*
Pattern 2: ((.*?)(\(\w{1,}\)))((\g<2>)(\g<3>))*

此外,例如,如果您正在处理一个长字符串,并且一个模式会自我重复,那么是否可以不断扩展完整匹配项,并在不编写与正则表达式分开的循环语句的情况下逐步增加组.

Also, if for example you're dealing with a long string, and a pattern repeats itself, is it possible to continually expand the full match, and incrementally increase the groups without writing a loop statement separate to the regex.

Full Match: Geese (Flock) Dogs (Pack) Elephants (Herd) 
Group 1: Geese (Flock)
Group 2: Geese
Group 3: (Flock)
Group 4: Dogs (Pack)
Group 5: Dogs
Group 6: (Pack)
Group 7: Elephants (Herd)
Group 8: Elephants 
Group 9: (Herd)

这是我接触到的最接近的这种模式,但中间组: 狗 (Pack) 变成 Group 0.

This is the closest I've came to was this pattern, but the middle group: Dogs (Pack) becomes Group 0.

((.*?)(\(\w{1,}\)))((.*?)(\g<3>))*

推荐答案

请注意 PCRE 中的递归级别是原子的.一旦这些模式找到匹配项,它们就不会被重新尝试.

Mind that recursion levels in PCRE are atomic. Once these patterns find a match they are never re-tried.

请参阅递归和子例程调用可能是原子的,也可能不是原子的:

PerlRuby 如果递归后正则表达式的其余部分失败,则回溯到递归.他们根据需要尝试递归的所有排列,以允许正则表达式的其余部分匹配.PCRE 将递归视为 原子.PCRE 在递归过程中正常回溯,但是一旦递归匹配,它就不会尝试递归的任何进一步排列,即使正则表达式的其余部分未能匹配.结果是 Perl 和 Ruby 可能会找到 PCRE 找不到的正则表达式匹配项,或者 Perl 和 Ruby 可能会找到不同的正则表达式匹配项.

Perl and Ruby backtrack into recursion if the remainder of the regex after the recursion fails. They try all permutations of the recursion as needed to allow the remainder of the regex to match. PCRE treats recursion as atomic. PCRE backtracks normally during the recursion, but once the recursion has matched, it does not try any further permutations of the recursion, even when the remainder of the regex fails to match. The result is that Perl and Ruby may find regex matches that PCRE cannot find, or that Perl and Ruby may find different regex matches.

你的第二个模式,在第一个递归级别,看起来像

Your second pattern, at the first recursion level, will look like

((.*?)(\(\w{1,}\)))(((?>.*?))((?>\(\w{1,}\))))*
                     ^^^^^^^  ^^^^^^^^^^^^^^

请参阅演示.也就是说,\g<2>(?>.*?),而不是 .*?.这意味着,在 ((.*?)(\(\w{1,}\))) 模式匹配 Geese (Flock) 之后,正则表达式引擎会尝试与 (?>.*?) 匹配,看到它是一个懒惰的模式,不必消耗任何字符,跳过它(并且永远不会回到这个模式),并尝试与 (?>\(\w{1,}\)) 匹配.由于没有 ( after ),正则表达式返回它消耗的内容.

See demo. That is, \g<2> is then (?>.*?), not .*?. That means that, after the ((.*?)(\(\w{1,}\))) pattern matched Geese (Flock), the regex engine tries to match with (?>.*?), sees it is a lazy pattern that does not have to consume any chars, skips it (and will never come back to this pattern), and tries to match with (?>\(\w{1,}\)). As there is no ( after ), the regex returns what it consumed.

关于第二个问题,这是一个常见的问题.使用 PCRE 正则表达式不可能获得任意数量的捕获,因为在重复捕获的情况下,只有最后捕获的值存储在组缓冲区中.结果数组中的子匹配数不能超过正则表达式模式内的捕获组数.请参阅重复捕获组与捕获重复组了解更多详情.

As for the second question, it is a common problem. It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer. You cannot have more submatches in the resulting array than the number of capturing groups inside the regex pattern. See Repeating a Capturing Group vs. Capturing a Repeated Group for more details.

这篇关于正则表达式递归:第 N 个子模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆