我可以使用 Perl 正则表达式来匹配平衡文本吗? [英] Can I use Perl regular expressions to match balanced text?

查看:26
本文介绍了我可以使用 Perl 正则表达式来匹配平衡文本吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在 Perl 中匹配括号等中的文本.我该怎么做?

I would like to match text enclosed in brackets etc in Perl. How can I do that?

这是来自官方 perlfaq 的问题.我们将 perlfaq 导入 Stack Overflow..>

This is a question from the official perlfaq. We're importing the perlfaq to Stack Overflow.

推荐答案

这是官方常见问题解答 减去任何后续编辑.

This is the official FAQ answer minus any subsequent edits.

您的第一次尝试应该是 Text::Balanced 模块,它是在 Perl 5.8 之后的 Perl 标准库中.它有多种功能来处理棘手的文本.Regexp::Common 模块也可以提供帮助,提供您可以使用的预设模式.

Your first try should probably be the Text::Balanced module, which is in the Perl standard library since Perl 5.8. It has a variety of functions to deal with tricky text. The Regexp::Common module can also help by providing canned patterns you can use.

从 Perl 5.10 开始,您可以使用递归模式将平衡文本与正则表达式匹配.在 Perl 5.10 之前,您不得不求助于各种技巧,例如在 (??{}) 序列中使用 Perl 代码.

As of Perl 5.10, you can match balanced text with regular expressions using recursive patterns. Before Perl 5.10, you had to resort to various tricks such as using Perl code in (??{}) sequences.

这是一个使用递归正则表达式的示例.目标是捕获尖括号内的所有文本,包括嵌套尖括号中的文本.此示例文本有两个主要"组:具有一级嵌套的组和具有两级嵌套的组.尖括号中总共有五个组:

Here's an example using a recursive regular expression. The goal is to capture all of the text within angle brackets, including the text in nested angle brackets. This sample text has two "major" groups: a group with one level of nesting and a group with two levels of nesting. There are five total groups in angle brackets:

I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.

匹配平衡文本的正则表达式使用了两个新的(Perl 5.10)正则表达式功能.这些在 perlre 中有介绍,本示例是该文档中一个的修改版本.

The regular expression to match the balanced text uses two new (to Perl 5.10) regular expression features. These are covered in perlre and this example is a modified version of one in that documentation.

首先,将新的所有格 + 添加到任何量词会找到最长的匹配并且不会回溯.这很重要,因为您想通过递归而不是回溯来处理任何尖括号.[^<>]++ 组在不回溯的情况下找到一个或多个非尖括号.

First, adding the new possessive + to any quantifier finds the longest match and does not backtrack. That's important since you want to handle any angle brackets through the recursion, not backtracking. The group [^<>]++ finds one or more non-angle brackets without backtracking.

其次,新的(?PARNO) 指的是PARNO 给定的特定捕获组中的子模式.在下面的正则表达式中,第一个捕获组查找(并记住)平衡的文本,并且您需要在第一个缓冲区中使用相同的模式来越过嵌套文本.这就是递归部分.(?1) 使用外部捕获组中的模式作为正则表达式的独立部分.

Second, the new (?PARNO) refers to the sub-pattern in the particular capture group given by PARNO. In the following regex, the first capture group finds (and remembers) the balanced text, and you need that same pattern within the first buffer to get past the nested text. That's the recursive part. The (?1) uses the pattern in the outer capture group as an independent part of the regex.

综合起来,你有:

#!/usr/local/bin/perl5.10.0

my $string =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.
HERE

my @groups = $string =~ m/
        (                   # start of capture group 1
        <                   # match an opening angle bracket
            (?:
                [^<>]++     # one or more non angle brackets, non backtracking
                  |
                (?1)        # found < or >, so recurse to capture group 1
            )*
        >                   # match a closing angle bracket
        )                   # end of capture group 1
        /xg;

$" = "
	";
print "Found:
	@groups
";

输出显示 Perl 找到了两个主要组:

The output shows that Perl found the two major groups:

Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

通过一些额外的工作,您可以获得尖括号中的所有组,即使它们也位于其他尖括号中.每次获得平衡匹配时,删除其外部分隔符(这是您刚刚匹配的分隔符,因此不要再次匹配)并将其添加到要处理的字符串队列中.继续这样做,直到没有匹配项:

With a little extra work, you can get the all of the groups in angle brackets even if they are in other angle brackets too. Each time you get a balanced match, remove its outer delimiter (that's the one you just matched so don't match it again) and add it to a queue of strings to process. Keep doing that until you get no matches:

#!/usr/local/bin/perl5.10.0

my @queue =<<"HERE";
I have some <brackets in <nested brackets> > and
<another group <nested once <nested twice> > >
and that's it.
HERE

my $regex = qr/
        (                   # start of bracket 1
        <                   # match an opening angle bracket
            (?:
                [^<>]++     # one or more non angle brackets, non backtracking
                  |
                (?1)        # recurse to bracket 1
            )*
        >                   # match a closing angle bracket
        )                   # end of bracket 1
        /x;

$" = "
	";

while( @queue )
    {
    my $string = shift @queue;

    my @groups = $string =~ m/$regex/g;
    print "Found:
	@groups

" if @groups;

    unshift @queue, map { s/^<//; s/>$//; $_ } @groups;
    }

输出显示所有组.最外面的匹配首先出现,嵌套的匹配稍后出现:

The output shows all of the groups. The outermost matches show up first and the nested matches so up later:

Found:
    <brackets in <nested brackets> >
    <another group <nested once <nested twice> > >

Found:
    <nested brackets>

Found:
    <nested once <nested twice> >

Found:
    <nested twice>

这篇关于我可以使用 Perl 正则表达式来匹配平衡文本吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆