“现代"的识别能力被称为“现代".正则表达式 [英] The recognizing power of "modern" regexes

查看:94
本文介绍了“现代"的识别能力被称为“现代".正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

真正的现代正则表达式实际上可以识别哪种语言?

What class of languages do real modern regexes actually recognise?

只要有无限长度的捕获组具有反向引用(例如(.*)_\1),则正则表达式现在就会与非常规语言匹配.但是,仅凭它本身还不足以匹配S ::= '(' S ')' | ε之类的东西-匹配配对对的无上下文语言.

Whenever there is an unbounded length capturing group with a back-reference (e.g. (.*)_\1) a regex is now matching a non-regular language. But this, on its own, isn't enough to match something like S ::= '(' S ')' | ε — the context-free language of matching pairs of parens.

递归正则表达式(对我来说是新手,但我确信Perl和PCRE中存在)可以识别至少大多数CFL.

Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.

有人在这方面做过或阅读过任何研究吗?这些现代"正则表达式的局限性是什么?他们对LL或LR语法的认识比CFG严格多或少吗?还是存在两种可以被正则表达式识别但CFG 不能识别的语言?

Has anyone done or read any research in this area? What are the limitations of these "modern" regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG and the opposite?

与相关论文的链接将不胜感激.

Links to relevant papers would be much appreciated.

推荐答案

模式递归

对于递归模式,您具有递归下降形式 matching .

这对于各种问题都很好,但是一旦您真正想进行递归下降解析,您就需要在这里和那里插入捕获组,并且恢复完整的解析结构很尴尬.这样. Damian Conway的 Regexp :: Grammars 模块Perl将简单的模式转换为等效的模式,该模式会自动将所有名为捕获"的功能转换为递归数据结构,从而使解析结构的检索变得更加容易.在本文的结尾,我有一个示例比较了这两种方法.

This is fine for a variety of problems, but once you want to actually do recursive descent parsing, you need to insert capture groups here and there, and it is awkward to recover the full parse structure in this way. Damian Conway’s Regexp::Grammars module for Perl transforms the simple pattern into an equivalent one that automatically does all that named capturing into a recursive data structure, making for far easier retrieval of the parsed structure. I have a sample comparing these two approaches at end of this posting.

问题是递归模式可以匹配哪些语法.好吧,它们肯定是递归血统类型匹配器.唯一想到的是递归模式无法处理左递归.这对可以应用它们的语法施加了限制.有时,您可以对制作进行重新排序以消除左递归.

The question was what kinds of grammars that recursive patterns can match. Well, they’re certainly recursive descent type matchers. The only thing that comes to mind is that recursive patterns cannot handle left recursion. This puts a constraint on the sorts of grammars that you can apply them to. Sometimes you can reorder your productions to eliminate left recursion.

BTW,PCRE和Perl在允许您使用递归的方式上略有不同.请参见 pcrepattern 联机帮助页中的递归模式"和与Perl的递归差异"部分.例如:Perl可以处理^(.|(.)(?1)\2)$,而PCRE则需要^((.)(?1)\2|.)$.

BTW, PCRE and Perl differ slightly on how you’re allowed to phrase the recursion. See the sections on "RECURSIVE PATTERNS" and "Recursion difference from Perl" in the pcrepattern manpage. eg: Perl can handle ^(.|(.)(?1)\2)$ where PCRE requires ^((.)(?1)\2|.)$ instead.

对递归模式的需求令人惊讶地频繁出现.一个广为人知的例子是,当您需要匹配一些可以嵌套的内容时,例如平衡的括号,引号,甚至HTML/XML标签.这是平衡的父母的配对:

The need for recursive patterns arises surprisingly frequently. One well-visited example is when you need to match something that can nest, such as balanced parentheses, quotes, or even HTML/XML tags. Here’s the match for balenced parens:

\((?:[^()]*+|(?0))*\)

由于其紧凑的性质,我觉得阅读起来比较棘手.使用/x模式可以轻松解决此问题,以使空格不再有意义:

I find that trickier to read because of its compact nature. This is easily curable with /x mode to make whitespace no longer significant:

\( (?: [^()] *+ | (?0) )* \)

然后再一次,因为我们使用parens进行递归,所以一个更清晰的示例将匹配嵌套的单引号:

Then again, since we’re using parens for our recursion, a clearer example would be matching nested single quotes:

‘ (?: [^‘’] *+ | (?0) )* ’

您可能希望匹配的另一种递归定义的东西是回文.这种简单的模式在Perl中有效:

Another recursively defined thing you may wish to match would be a palindrome. This simple pattern works in Perl:

^((.)(?1)\2|.?)$

您可以使用以下方法在大多数系统上进行测试:

which you can test on most systems using something like this:

$ perl -nle 'print if /^((.)(?1)\2|.?)$/i' /usr/share/dict/words

请注意,PCRE实施递归需要更详尽的

Note that PCRE’s implementation of recursion requires the more elaborate

^(?:((.)(?1)\2|)|((.)(?3)\4|.))

这是因为PCRE递归的工作方式受到限制.

This is because of restrictions on how PCRE recursion works.

对我来说,上面的示例主要是玩具匹配,并不是所有真正有趣的 .当它变得有趣时,就是当您尝试解析一个真实的语法时.例如,RFC 5322相当详细地定义了一个邮件地址.这是与之匹配的语法"模式:

To me, the examples above are mostly toy matches, not all that interesting, really. When it becomes interesting is when you have a real grammar you’re trying to parse. For example, RFC 5322 defines a mail address rather elaborately. Here’s a "grammatical" pattern to match it:

$rfc5322 = qr{

   (?(DEFINE)

     (?<address>         (?&mailbox) | (?&group))
     (?<mailbox>         (?&name_addr) | (?&addr_spec))
     (?<name_addr>       (?&display_name)? (?&angle_addr))
     (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
     (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
     (?<display_name>    (?&phrase))
     (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)

     (?<addr_spec>       (?&local_part) \@ (?&domain))
     (?<local_part>      (?&dot_atom) | (?&quoted_string))
     (?<domain>          (?&dot_atom) | (?&domain_literal))
     (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                   \] (?&CFWS)?)
     (?<dcontent>        (?&dtext) | (?&quoted_pair))
     (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])

     (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
     (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
     (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
     (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)

     (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
     (?<quoted_pair>     \\ (?&text))

     (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
     (?<qcontent>        (?&qtext) | (?&quoted_pair))
     (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                          (?&FWS)? (?&DQUOTE) (?&CFWS)?)

     (?<word>            (?&atom) | (?&quoted_string))
     (?<phrase>          (?&word)+)

     # Folding white space
     (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
     (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
     (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
     (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
     (?<CFWS>            (?: (?&FWS)? (?&comment))*
                         (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))

     # No whitespace control
     (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])

     (?<ALPHA>           [A-Za-z])
     (?<DIGIT>           [0-9])
     (?<CRLF>            \x0d \x0a)
     (?<DQUOTE>          ")
     (?<WSP>             [\x20\x09])
   )

   (?&address)

}x;

如您所见,这非常类似于BNF.问题在于这只是一场比赛,而不是一场比赛.而且,您确实不希望只用捕获paren来包围整个事情,因为这并不能告诉您哪个产品与哪个部分匹配.使用前面提到的Regexp :: Grammars模块,我们可以.

As you see, that’s very BNF-like. The problem is it is just a match, not a capture. And you really don’t want to just surround the whole thing with capturing parens because that doesn’t tell you which production matched which part. Using the previously mentioned Regexp::Grammars module, we can.

#!/usr/bin/env perl

use strict;
use warnings;
use 5.010;
use Data::Dumper "Dumper";

my $rfc5322 = do {
    use Regexp::Grammars;    # ...the magic is lexically scoped
    qr{

    # Keep the big stick handy, just in case...
    # <debug:on>

    # Match this...
    <address>

    # As defined by these...
    <token: address>         <mailbox> | <group>
    <token: mailbox>         <name_addr> | <addr_spec>
    <token: name_addr>       <display_name>? <angle_addr>
    <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
    <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
    <token: display_name>    <phrase>
    <token: mailbox_list>    <[mailbox]> ** (,)

    <token: addr_spec>       <local_part> \@ <domain>
    <token: local_part>      <dot_atom> | <quoted_string>
    <token: domain>          <dot_atom> | <domain_literal>
    <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?

    <token: dcontent>        <dtext> | <quoted_pair>
    <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]

    <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
    <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
    <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
    <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*

    <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
    <token: quoted_pair>     \\ <.text>

    <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
    <token: qcontent>        <.qtext> | <.quoted_pair>
    <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                             <.FWS>? <.DQUOTE> <.CFWS>?

    <token: word>            <.atom> | <.quoted_string>
    <token: phrase>          <.word>+

    # Folding white space
    <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
    <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
    <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
    <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
    <token: CFWS>            (?: <.FWS>? <.comment>)*
                             (?: (?:<.FWS>? <.comment>) | <.FWS>)

    # No whitespace control
    <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]
    <token: ALPHA>           [A-Za-z]
    <token: DIGIT>           [0-9]
    <token: CRLF>            \x0d \x0a
    <token: DQUOTE>          "
    <token: WSP>             [\x20\x09]
    }x;
};

while (my $input = <>) {
    if ($input =~ $rfc5322) {
        say Dumper \%/;       # ...the parse tree of any successful match
                              # appears in this punctuation variable
    }
}

如您所见,通过在模式中使用略有不同的表示法,您现在得到了一些内容,该内容将整个解析树存储在%/变量中,并为所有内容都加上了整齐的标签.转换的结果仍然是一个模式,如=~运算符所见.有点神奇.

As you see, by using a very slightly different notation in the pattern, you now get something which stores the entire parse tree away for you in the %/ variable, with everything neatly labelled. The result of the transformation is still a pattern, as you can see by the =~ operator. It’s just a bit magical.

这篇关于“现代"的识别能力被称为“现代".正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆