“现代"的识别力正则表达式 [英] The recognizing power of "modern" regexes

查看:30
本文介绍了“现代"的识别力正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

真正的现代正则表达式能识别哪类语言?

What class of languages do real modern regexes actually recognise?

每当存在具有反向引用的无限长度捕获组(例如 (.*)_1)时,正则表达式就会匹配非常规语言.但这本身并不足以匹配 S ::= '(' S ')' | 之类的东西.ε — 匹配括号对的上下文无关语言.

Whenever there is an unbounded length capturing group with a back-reference (e.g. (.*)_1) a regex is now matching a non-regular language. But this, on its own, isn't enough to match something like S ::= '(' S ')' | ε — the context-free language of matching pairs of parens.

递归正则表达式(对我来说是新的,但我确信存在于 Perl 和 PCRE 中)似乎至少可以识别大多数 CFL.

Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.

有没有人做过或阅读过这方面的研究?这些现代"正则表达式的局限性是什么?他们对 LL 或 LR 文法的 CFG 是否严格多于或严格少于 CFG?或者是否存在两种可以被正则表达式识别但不能被 CFG 识别的语言相反?

Has anyone done or read any research in this area? What are the limitations of these "modern" regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG and the opposite?

相关论文的链接将不胜感激.

Links to relevant papers would be much appreciated.

推荐答案

模式递归

使用递归模式,您有一种递归下降匹配的形式.

这对于各种问题都很好,但是一旦你想要真正做递归下降解析,你需要在这里和那里插入捕获组,恢复完整的解析结构很尴尬通过这种方式.Damian Conway 的 Regexp::Grammars 模块用于Perl 将简单模式转换为等效模式,自动将所有命名捕获转换为递归数据结构,从而更容易检索已解析的结构.在这篇文章的末尾,我有一个比较这两种方法的示例.

This is fine for a variety of problems, but once you want to actually do recursive descent parsing, you need to insert capture groups here and there, and it is awkward to recover the full parse structure in this way. Damian Conway’s Regexp::Grammars module for Perl transforms the simple pattern into an equivalent one that automatically does all that named capturing into a recursive data structure, making for far easier retrieval of the parsed structure. I have a sample comparing these two approaches at end of this posting.

问题是递归模式可以匹配什么样的语法.嗯,它们肯定是递归下降类型匹配器.唯一想到的是递归模式无法处理左递归.strong> 这限制了您可以应用它们的语法类型.有时您可以重新排序您的作品以消除左递归.

The question was what kinds of grammars that recursive patterns can match. Well, they’re certainly recursive descent type matchers. The only thing that comes to mind is that recursive patterns cannot handle left recursion. This puts a constraint on the sorts of grammars that you can apply them to. Sometimes you can reorder your productions to eliminate left recursion.

顺便说一句,PCRE 和 Perl 在如何表述递归方面略有不同.请参阅 pcrepattern 联机帮助页中有关递归模式"和与 Perl 的递归差异"的部分.例如:Perl 可以处理 ^(.|(.)(?1)2)$ 其中 PCRE 需要 ^((.)(?1)2|.)$代替.

BTW, PCRE and Perl differ slightly on how you’re allowed to phrase the recursion. See the sections on "RECURSIVE PATTERNS" and "Recursion difference from Perl" in the pcrepattern manpage. eg: Perl can handle ^(.|(.)(?1)2)$ where PCRE requires ^((.)(?1)2|.)$ instead.

对递归模式的需求出乎意料地频繁出现.一个经常访问的示例是当您需要匹配可以嵌套的内容时,例如平衡括号、引号,甚至 HTML/XML 标签.这是平衡括号的匹配:

The need for recursive patterns arises surprisingly frequently. One well-visited example is when you need to match something that can nest, such as balanced parentheses, quotes, or even HTML/XML tags. Here’s the match for balenced parens:

((?:[^()]*+|(?0))*)

我觉得阅读起来比较棘手,因为它的紧凑性.这可以通过 /x 模式轻松解决,使空白不再重要:

I find that trickier to read because of its compact nature. This is easily curable with /x mode to make whitespace no longer significant:

( (?: [^()] *+ | (?0) )* )

再说一次,因为我们在递归中使用括号,一个更清晰的例子是匹配嵌套的单引号:

Then again, since we’re using parens for our recursion, a clearer example would be matching nested single quotes:

‘ (?: [^‘’] *+ | (?0) )* ’

您可能希望匹配的另一个递归定义的东西是回文.这个简单的模式适用于 Perl:

Another recursively defined thing you may wish to match would be a palindrome. This simple pattern works in Perl:

^((.)(?1)2|.?)$

您可以使用以下方法在大多数系统上进行测试:

which you can test on most systems using something like this:

$ perl -nle 'print if /^((.)(?1)2|.?)$/i' /usr/share/dict/words

注意 PCRE 的递归实现需要更精细的

Note that PCRE’s implementation of recursion requires the more elaborate

^(?:((.)(?1)2|)|((.)(?3)4|.))

这是因为 PCRE 递归的工作方式受到限制.

This is because of restrictions on how PCRE recursion works.

对我来说,上面的例子主要是玩具火柴,并不是所有有趣的,真的.当它变得有趣时,就是当您尝试解析真正的语法时.例如,RFC 5322 相当详细地定义了一个邮件地址.这是与之匹配的语法"模式:

To me, the examples above are mostly toy matches, not all that interesting, really. When it becomes interesting is when you have a real grammar you’re trying to parse. For example, RFC 5322 defines a mail address rather elaborately. Here’s a "grammatical" pattern to match it:

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) @ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? [ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   ] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [x21-x5ax5e-x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: . (?&atext)+)*)

     (?<text>            [x01-x09x0bx0cx0e-x7f])
     (?<quoted_pair>     \ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [x21x23-x5bx5d-x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [x21-x27x2a-x5bx5d-x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         ( (?: (?&FWS)? (?&ccontent))* (?&FWS)? ) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [x01-x08x0bx0cx0e-x1fx7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            x0d x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [x20x09])
   )

   (?&address)

}x;

如您所见,这非常类似于 BNF.问题是它只是匹配,而不是捕获.而且你真的不想只用捕获括号来包围整个事情,因为这不会告诉你哪个制作与哪个部分匹配.使用前面提到的 Regexp::Grammars 模块,我们可以.

As you see, that’s very BNF-like. The problem is it is just a match, not a capture. And you really don’t want to just surround the whole thing with capturing parens because that doesn’t tell you which production matched which part. Using the previously mentioned Regexp::Grammars module, we can.

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use Data::Dumper "Dumper";

my $rfc5322 = do {
    use Regexp::Grammars;    # ...the magic is lexically scoped
    qr{

    # Keep the big stick handy, just in case...
    # <debug:on>

    # Match this...
    <address>

    # As defined by these...
    <token: address>         <mailbox> | <group>
    <token: mailbox>         <name_addr> | <addr_spec>
    <token: name_addr>       <display_name>? <angle_addr>
    <token: angle_addr>      <CFWS>? < <addr_spec> > <CFWS>?
    <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
    <token: display_name>    <phrase>
    <token: mailbox_list>    <[mailbox]> ** (,)

    <token: addr_spec>       <local_part> @ <domain>
    <token: local_part>      <dot_atom> | <quoted_string>
    <token: domain>          <dot_atom> | <domain_literal>
    <token: domain_literal>  <CFWS>? [ (?: <FWS>? <[dcontent]>)* <FWS>?

    <token: dcontent>        <dtext> | <quoted_pair>
    <token: dtext>           <.NO_WS_CTL> | [x21-x5ax5e-x7e]

    <token: atext>           <.ALPHA> | <.DIGIT> | [!#$%&'*+-/=?^_`{|}~]
    <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom_text>   <.atext>+ (?: . <.atext>+)*

    <token: text>            [x01-x09x0bx0cx0e-x7f]
    <token: quoted_pair>     \ <.text>

    <token: qtext>           <.NO_WS_CTL> | [x21x23-x5bx5d-x7e]
    <token: qcontent>        <.qtext> | <.quoted_pair>
    <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                             <.FWS>? <.DQUOTE> <.CFWS>?

    <token: word>            <.atom> | <.quoted_string>
    <token: phrase>          <.word>+

    # Folding white space
    <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
    <token: ctext>           <.NO_WS_CTL> | [x21-x27x2a-x5bx5d-x7e]
    <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
    <token: comment>         ( (?: <.FWS>? <.ccontent>)* <.FWS>? )
    <token: CFWS>            (?: <.FWS>? <.comment>)*
                             (?: (?:<.FWS>? <.comment>) | <.FWS>)

    # No whitespace control
    <token: NO_WS_CTL>       [x01-x08x0bx0cx0e-x1fx7f]
    <token: ALPHA>           [A-Za-z]
    <token: DIGIT>           [0-9]
    <token: CRLF>            x0d x0a
    <token: DQUOTE>          "
    <token: WSP>             [x20x09]
    }x;
};

while (my $input = <>) {
    if ($input =~ $rfc5322) {
        say Dumper \%/;       # ...the parse tree of any successful match
                              # appears in this punctuation variable
    }
}

如您所见,通过在模式中使用稍微不同的符号,您现在可以在 %/ 变量中为您存储整个解析树,并且所有内容都被整齐地标记.转换的结果仍然是一个模式,正如您可以通过 =~ 运算符看到的那样.只是有点神奇.

As you see, by using a very slightly different notation in the pattern, you now get something which stores the entire parse tree away for you in the %/ variable, with everything neatly labelled. The result of the transformation is still a pattern, as you can see by the =~ operator. It’s just a bit magical.

这篇关于“现代"的识别力正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆