我可以使用Perl正则表达式来匹配平衡文本吗? [英] Can I use Perl regular expressions to match balanced text?

查看:87
本文介绍了我可以使用Perl正则表达式来匹配平衡文本吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想匹配Perl中括在方括号等中的文本.我该怎么办?

I would like to match text enclosed in brackets etc in Perl. How can I do that?

这是来自官方代表的问题.我们正在导入perlfaq到Stack Overflow .

This is a question from the official perlfaq. We're importing the perlfaq to Stack Overflow.

推荐答案

这是官方常见问题解答减去所有后续修改.

This is the official FAQ answer minus any subsequent edits.

您的第一次尝试应该是 Text :: Balanced 模块,自Perl 5.8起在Perl标准库中提供.它具有多种功能来处理棘手的文本. Regexp :: Common 模块还可以通过提供可以使用的固定模式来提供帮助.

Your first try should probably be the Text::Balanced module, which is in the Perl standard library since Perl 5.8. It has a variety of functions to deal with tricky text. The Regexp::Common module can also help by providing canned patterns you can use.

从Perl 5.10开始,您可以使用递归模式将平衡文本与正则表达式匹配.在Perl 5.10之前,您必须诉诸各种技巧,例如在(??{})序列中使用Perl代码.

As of Perl 5.10, you can match balanced text with regular expressions using recursive patterns. Before Perl 5.10, you had to resort to various tricks such as using Perl code in (??{}) sequences.

这是一个使用递归正则表达式的示例.目的是捕获尖括号内的所有文本,包括嵌套尖括号内的文本.此示例文本有两个主要"组:一组具有一个嵌套级别的组和一组具有两个嵌套级别的组.尖括号中共有五个组:

Here's an example using a recursive regular expression. The goal is to capture all of the text within angle brackets, including the text in nested angle brackets. This sample text has two "major" groups: a group with one level of nesting and a group with two levels of nesting. There are five total groups in angle brackets:

I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.

匹配平衡文本的正则表达式使用两个新的(对Perl 5.10而言)正则表达式功能.这些内容在 perlre 中进行了说明,此示例是该文档中的内容的修改版本.

The regular expression to match the balanced text uses two new (to Perl 5.10) regular expression features. These are covered in perlre and this example is a modified version of one in that documentation.

首先,将新的所有格+添加到任何量词中都会找到最长的匹配项,并且不会回溯.这很重要,因为您希望通过递归而不是回溯来处理任何尖括号. [^<>]++组发现一个或多个无尖括号而没有回溯.

First, adding the new possessive + to any quantifier finds the longest match and does not backtrack. That's important since you want to handle any angle brackets through the recursion, not backtracking. The group [^<>]++ finds one or more non-angle brackets without backtracking.

第二,新的(?PARNO)引用由PARNO给出的特定捕获组中的子模式.在下面的正则表达式中,第一个捕获组找到(并记住)平衡的文本,并且您需要在第一个缓冲区内使用相同的模式才能越过嵌套的文本.那是递归的部分. (?1)使用外部捕获组中的模式作为正则表达式的独立部分.

Second, the new (?PARNO) refers to the sub-pattern in the particular capture group given by PARNO. In the following regex, the first capture group finds (and remembers) the balanced text, and you need that same pattern within the first buffer to get past the nested text. That's the recursive part. The (?1) uses the pattern in the outer capture group as an independent part of the regex.

将它们放在一起,您将拥有:

Putting it all together, you have:

#!/usr/local/bin/perl5.10.0

my $string =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.
HERE

my @groups = $string =~ m/
        (                   # start of capture group 1
        <                   # match an opening angle bracket
            (?:
                [^<>]++     # one or more non angle brackets, non backtracking
                  |
                (?1)        # found < or >, so recurse to capture group 1
            )*
        >                   # match a closing angle bracket
        )                   # end of capture group 1
        /xg;

$" = "\n\t";
print "Found:\n\t@groups\n";

输出显示Perl找到了两个主要组:

The output shows that Perl found the two major groups:

Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

通过一些额外的工作,即使它们也位于其他尖括号中,您也可以将所有组都放在尖括号中.每次获得平衡的匹配项时,请删除其外部定界符(这是您刚刚匹配的定界符,因此不再匹配),然后将其添加到要处理的字符串队列中.继续这样做,直到没有匹配项:

With a little extra work, you can get the all of the groups in angle brackets even if they are in other angle brackets too. Each time you get a balanced match, remove its outer delimiter (that's the one you just matched so don't match it again) and add it to a queue of strings to process. Keep doing that until you get no matches:

#!/usr/local/bin/perl5.10.0

my @queue =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.
HERE

my $regex = qr/
        (                   # start of bracket 1
        <                   # match an opening angle bracket
            (?:
                [^<>]++     # one or more non angle brackets, non backtracking
                  |
                (?1)        # recurse to bracket 1
            )*
        >                   # match a closing angle bracket
        )                   # end of bracket 1
        /x;

$" = "\n\t";

while( @queue )
    {
    my $string = shift @queue;

    my @groups = $string =~ m/$regex/g;
    print "Found:\n\t@groups\n\n" if @groups;

    unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
    }

输出显示所有组.最外面的匹配项首先显示,嵌套的匹配项随后显示:

The output shows all of the groups. The outermost matches show up first and the nested matches so up later:

Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

Found:
    <nested brackets>

Found:
    <nested once <nested twice> >

Found:
    <nested twice>

这篇关于我可以使用Perl正则表达式来匹配平衡文本吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆