< .ident> perl6语法中的功能/捕获 [英] <.ident> function/capture in perl6 grammars

查看:103
本文介绍了< .ident> perl6语法中的功能/捕获的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在阅读perl6的Xml语法时( https://github.com/supernovus/exemel/blob/master/lib/XML/Grammar.pm6 ),我在理解以下令牌时遇到了一些困难.

token pident {
  <!before \d> [ \d+ <.ident>* || <.ident>+ ]+ % '-'
}

更具体地说是< .ident>,没有ident的其他定义,因此我假设它是保留术语.虽然我找不到在perl6.org上找到合适的定义.有人知道这意味着什么吗?

解决方案

TL; DR 我将从一个准确而相对简洁的答案开始.该答案的其余部分适用于那些希望全面了解内置规则和/或特别要深入研究ident的人.

<.ident>函数/捕获

由于.<.ident>仅匹配,因此不捕获 [1] .对于此答案的其余部分,我通常会省略.,因为它除了捕获方面对规则的含义没有影响.

就像您可以在编程语言中在另一个函数的声明中调用(即调用")一个函数一样,因此您也可以调用规则/令牌/正则表达式/方法(此后,我通常只使用术语规则" )在另一条规则的声明中. <foo>是用于调用名为foo的规则的语法;因此<ident>调用名为ident的(方法).

在撰写本文时,XML::Grammar语法本身并未定义/声明名为ident的规则.这意味着该呼叫最终被调度到具有该名称的内置声明中.

内置的ident规则与声明为:

的功能完全相同

 token ident {
    [ <alpha> ]
    [ <alnum> ]*
}
 

官方预定义字符类文档应提供<alpha><alnum>的精确定义.或者,相关详细信息也将在以后的答案中提供.

最重要的是,ident与一个或多个字母数字"字符的字符串匹配,但第一个字符不能为数字".

因此abcdef123都匹配,而123abc不匹配.

此答案的其余部分

对于那些有兴趣详细了解的人,我写了以下部分:

  • Raku (标准语言和课程详细信息)

  • Rakudo (高级实现)

  • NQP (中级实施)

  • MoarVM (低级别实现)

  • ident

  • 的规范和规范"
  • <ident>,字符类"和标识符"的文档(更正)

  • ident与Raku 标识符

Raku (标准语言和课程详细信息)

XML::Grammar是用户定义的Raku语法. Raku语法是一门课. (语法实际上只是一些专门的类"..)

Raku规则是正则表达式是一种方法:

 grammar foo { rule ident { ... } }

say foo.^lookup('ident').WHAT; # (Regex)
say Regex ~~ Method;           # True
 

调用 call (如语法中的<ident>) rel ="nofollow noreferrer"> .parse 或类似的语法. .parse调用根据语法规则匹配输入字符串.

在比赛期间评估在XML::Grammar中出现的<ident>时,结果是对XML::Grammar实例的ident方法(规则)调用(.parse调用创建其实例).如果只是一个类型对象,就会引起骚动.

因为XML::Grammar本身没有定义该名称的规则/方法,所以ident调用而是根据标准方法解析规则来分派的. (我在这里使用的是非Raku的一般意义上的规则"一词.啊,语言.)

在Raku中,使用grammar foo { ... }形式的声明创建的任何类都会自动从Grammar类继承,而该类又从Match类继承:

 say .^mro given grammar foo {} # ((foo) (Grammar) (Match) (Capture) (Cool) (Any) (Mu))
 

在内置的Match类中找到

ident.

Rakudo(高级实现)

在Rakudo编译器中, Match does 角色 NQPMatchRole.

NQPMatchRole是找到ident的最高级别实现的地方.

NQP(中级实施)

NQPMatchRole用nqp语言编写,Raku的一个子集用于引导整个Raku,并且ident声明 first 字符的匹配可归结为:

    nqp::ord($target, $!pos) == 95
|| nqp::iscclass(nqp::const::CCLASS_ALPHABETIC, $target, $!pos)
 

如果 first 字符是 _(95是下划线的ASCII代码/Unicode代码点)与NQP中定义的称为CCLASS_ALPHABETIC的字符类匹配的字符.

其他显着代码是:

 nqp::findnotcclass( nqp::const::CCLASS_WORD
 

这与字符类CCLASS_WORD中的零个或多个后续字符匹配.

CCLASS_ALPHABETIC 的NQP的搜索显示了多个匹配项.最有用的似乎是 NQP测试文件.尽管此文件清楚地表明CCLASS_WORDCCLASS_ALPHABETIC的超集,但它并不能弄清楚这些类实际上匹配什么.

NQP针对多个后端"或具体的虚拟机.鉴于Rakudo/NQP doc/测试对这些规则和字符类的实际匹配程度相对较少,因此必须查看其后端之一以验证什么.

MoarVM (底层实现)

MoarVM 是唯一受官方支持的后端.

对MoarVM进行的CCLASS 搜索显示了多个匹配项. /p>

重要的似乎是 ops.c 其中包括switch (cclass)语句依次包含MVM_CCLASS_ALPHABETICMVM_CCLASS_WORD的情况,这些情况与NQP的类似命名常量相对应.

根据代码的注释:

CCLASS_ALPHABETIC当前与完全Raku或NQP完全相同的字符匹配 <:L> 规则,即Unicode字符已归类为字母".

我认为这意味着<alpha>等同于CCLASS_ALPHABETIC_的并集(下划线).

CCLASS_WORD匹配相同的加号<:Nd>,即十进制数字(使用任何人类语言,而不仅仅是英语).

我认为这意味着Raku/NQP <alnum>规则等同于CCLASS_WORD.

ident

的规范和规范"

Raku的官方规范体现在 roast [2] 中.

搜索ident 的烘烤显示了多个匹配项.

大多数情况下,仅将<ident>用作测试其他内容的一部分.规范要求它们必须按所示方式工作,但是通过查看偶然的用法,您将无法理解<ident>应该做什么.

三个测试清楚地测试了<ident>本身.其中之一本质上是多余的,剩下两个.我发现这两个匹配项的6.c6.c.errata版本之间没有任何变化:

来自 S05-mass/rx.t :

 ok ('2+3 ab2' ~~ /<ident>/) && matchcheck($/, q/mob<ident>: <ab2 @ 4>/), 'capturing builtin <ident>';
 

ok测试其第一个参数是否返回True.此调用测试<ident>跳过2+3并匹配ab2.

来自 S05-mass/charsets.t :

 is $latin-chars.comb(/<ident>/).join(" "), "ABCDEFGHIJKLMNOPQRSTUVWXYZ _ abcdefghijklmnopqrstuvwxyz ª µ º ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö øùúûüýþÿ", 'ident chars';
 

is测试其第一个参数与第二个参数匹配.此调用测试ident规则与由前256个Unicode代码点(拉丁1字符集)组成的字符串匹配的内容.

这是此测试的变体,可以更清楚地显示发生的匹配:

 say ~$_ for $latin-chars ~~ m:g/<ident>/;
 

打印:

 ABCDEFGHIJKLMNOPQRSTUVWXYZ
_
abcdefghijklmnopqrstuvwxyz
ª
µ
º
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
øùúûüýþÿ
 

但是<ident>匹配的内容不仅仅来自Latin-1中的大约一百个字符.因此,尽管以上测试涵盖了官方指定/测试的匹配的内容,但它们显然并未涵盖全部内容.

因此,让我们看一下官方推测 被认为与规范"有关.

首先,我们注意到顶部的警告:

 Note: these documents may be out of date.
For Perl 6 documentation see docs.perl6.org;
for specs, see the official test suite.
 

此警告中的规格"一词是规格"的简称.正如已经说明的那样,官方规范测试套件是 roast ,而不是任何人类语言.

(有些人仍然将这些历史设计文档也视为规格",并将它们称为规格",但官方观点是,应将适用于设计文档的规格"视为推测"的缩写,强调它们不是要完全依赖的东西.)

A :

这些是任何语法或正则表达式的一些预定义子规则:

  • ident ...匹配标识符.

呃...

<ident>,字符类"和标识符"的文档(更正)

来自 官方文档中的预定义字符类 :

     Class                             Description
    <ident>                           Identifier. Also a default rule.
 

这在三种方式上具有误导性:

  • ident不是 字符类.字符类与该字符类中的单个字符匹配;如果与量词一起使用,它们仅匹配一串这样的字符,每个字符都可以是该类中的任何字符.相反,<ident>匹配特定的字符模式.它可能是一个字符,但您无法控制.规则是贪婪的,匹配符合模式的字符数.如果您应用量词,它将控制整个规则的重复,而不是规则的单个匹配中包含多少个字符.

  • 所有内置规则均为默认规则.我认为默认注释是为了强调,如果您不喜欢内置模式,则可以编写自己的ident规则.这对于所有规则都是正确的,尽管通常覆盖诸如<lower>(小写)之类的规范字符类之类的内置指令的意义要小得多.

  • ident 标识符匹配!或者,更准确地说,对于大多数Raku标识符而言,它并不是单独这样做的.有关详细信息,请参见下一部分.

ident与Raku 标识符

 my @Identifiers = < $bar %hash Foo Foo::Bar your_ident anothers' my-ident >; 
say (~$/ if m/^<ident>$/ for @Identifiers); # (Foo your_ident)
say (~$/ if m/ <ident> / for @Identifiers); # (bar hash Foo Foo your_ident anothers my)
 

在nqp的语法中,该语法在NQP的语法中定义. nqp ,有:

 token identifier { <.ident> [ <[\-']> <.ident> ]* }
 

使用Rakudo的语法定义的Raku语法. nqp ,有些代码看起来略有不同,但效果完全相同:

 token apostrophe { <[ ' \- ]> }
token identifier { <.ident> [ <.apostrophe> <.ident> ]* }
 

因此,<identifier>匹配包含一个或多个<ident>且介于两者之间的<apostrophe>的模式.

ident方法位于NQPMatchRole中,这意味着它是内置的,属于用户语法的规则名称空间.

但是identifier方法不是由Raku或nqp导出的.因此,它们不是用户语法的规则命名空间的一部分.

如果我们编写自己的indentifier令牌,我们可以看到它的作用:

 my token identifier { <.ident> [ <[\-']> <.ident> ]* }
my token sigil { <[$@%&]> }
say (~$/ if m/^ <sigil>? <identifier> $/ for @Identifiers)
 

显示:

 ($bar %hash Foo your_ident my-ident)
 

总结以上以及其他一些注意事项:

  • <ident>仅匹配<identifier>匹配项的部分(尽管简单名称相同).考虑is-prime.这是Raku标识符,但包含两个 <ident>匹配项(isprime).

  • <identifier>仅匹配"Raku标识符"的部分(尽管它们与简单名称相同).考虑infix:<+>.有时称为Raku标识符,但需要同时匹配<identifier>:<+>.

  • Raku标识符本身只是名称的部分(尽管最简单的名称相同).考虑包含两个 <identifier>个匹配项的Foo-Bar::Baz-Qux(每个匹配项又包含两个<ident>匹配项).

脚语

[1] 如果您不确定捕获的内容,请参见命名捕获 roast 的特定分支的最新版本定义了Raku的特定版本.当我第一次写这个答案时,只有两个 official 分支/版本的烤肉,因此也只有Raku.第一个是 6.c aka 6.Christmas.这是在2015年圣诞节那天切割的,并已故意从那天起就冻结了.第二个是6.c.errata,保守地对被认为足够重要且向后兼容的6.c添加了更正,以包括在(当时)当前的官方推荐版本的Raku中.一个官方兼容的" Raku编译器通过了一些正式的烘烤程序. Rakudo编译器(然后)通过6.c.errata.例如,如果您阅读了烤肉的6.c.errata分支中涉及某个功能的所有测试,那么您将阅读该功能的正式指定的完整定义,该定义对于6.c.errata版本的Raku语言.

While reading the Xml grammar for perl6 (https://github.com/supernovus/exemel/blob/master/lib/XML/Grammar.pm6), I am having some difficulties understanding the following token.

token pident {
  <!before \d> [ \d+ <.ident>* || <.ident>+ ]+ % '-'
}

More specifically <.ident>, there are no other definitions of ident, so I am assuming it is a reserved term. Though i cant find find a proper definition on perl6.org. Does anyone know what this means?

解决方案

TL;DR I'll start with a precise and relatively concise answer. The rest of this answer is for those wanting to know more about built in rules in general and/or to drill down into ident in particular.

<.ident> function/capture

Because of the ., <.ident> only matches, it doesn't capture[1]. For the rest of this answer I'll generally omit the . because it makes no difference to a rule's meaning besides the capture aspect.

Just as you can invoke (aka "call") one function within the declaration of another in programming languages, so too you can invoke a rule/token/regex/method (hereafter I'll generally just use the term "rule") within the declaration of another rule. <foo> is the syntax used to invoke a rule named foo; so <ident> invokes a (method) namedident.

At the time I write this, XML::Grammar grammar does not itself define/declare a rule named ident. That means the call ends up dispatched to a built in declaration with that name.

The built in ident rule does precisely the same as if it were declared as:

token ident {
    [ <alpha> ]
    [ <alnum> ]*
}

The official Predefined character classes doc should provide precise definitions of <alpha> and <alnum>. Alternatively, the relevant details are also included later on in this answer.

The bottom line is that ident matches a string of one or more "alphanumeric" characters except that the first character cannot be a "number".

Thus both abc or def123 match whereas 123abc does not.

The rest of this answer

For those interested in detail worth knowing I've written the following sections:

  • Raku (standard language and class details)

  • Rakudo (high level implementation)

  • NQP (mid level implementation)

  • MoarVM (low level implementation)

  • The specification and "specification" of ident

  • (Corrections of) documentation of <ident>, "character class" and "identifier"

  • ident vs Raku identifiers

Raku (standard language and class details)

XML::Grammar is a user defined Raku grammar. A Raku grammar is a class. ("Grammars are really just slightly specialized classes".)

A Raku rule is a regex is a method:

grammar foo { rule ident { ... } }

say foo.^lookup('ident').WHAT; # (Regex)
say Regex ~~ Method;           # True

A rule call, like <ident>, in a grammar, is typically invoked as a result of calling .parse or similar on the grammar. The .parse call matches the input string according to the rules in the grammar.

When an occurrence of <ident> within XML::Grammar is evaluated during a match, the result is an ident method (rule) call on an instance of XML::Grammar (the .parse call creates an instance of its invocant if it's just a type object).

Because XML::Grammar does not itself define a rule/method of that name, the ident call is instead dispatched according to standard method resolution, er, rules. (I'm using the word "rules" here in the generic non-Raku specific sense. Ah, language.)

In Raku, any class created using a declaration of the form grammar foo { ... } automatically inherits from the Grammar class which in turn inherits from the Match class:

say .^mro given grammar foo {} # ((foo) (Grammar) (Match) (Capture) (Cool) (Any) (Mu))

ident is found in the built in Match class.

Rakudo (high level implementation)

In the Rakudo compiler, the Match class does the role NQPMatchRole.

This NQPMatchRole is where the highest level implementation of ident is found.

NQP (mid level implementation)

NQPMatchRole is written in the nqp language, a subset of Raku used to bootstrap the full Raku, and the heart of NQP, a compiler toolkit.

Excerpting and reformatting just the most salient code from the ident declaration, the match for the first character boils down to:

   nqp::ord($target, $!pos) == 95
|| nqp::iscclass(nqp::const::CCLASS_ALPHABETIC, $target, $!pos)

This matches if the first character is either a _ (95 is the ASCII code / Unicode codepoint for an underscore) or a character matching a character class defined in NQP called CCLASS_ALPHABETIC.

The other bit of salient code is:

nqp::findnotcclass( nqp::const::CCLASS_WORD

This matches zero or more subsequent characters in the character class CCLASS_WORD.

A search of NQP for CCLASS_ALPHABETIC shows several matches. The most useful seems to be an NQP test file. While this file makes it clear that CCLASS_WORD is a superset of CCLASS_ALPHABETIC, it doesn't make it clear what those classes actually match.

NQP targets multiple "backends" or concrete virtual machines. Given the relative paucity of Rakudo/NQP doc/tests of what these rules and character classes actually match, one has to look at one of its backends to verify what's what.

MoarVM (low level implementation)

MoarVM is the only officially supported backend.

A search of MoarVM for CCLASS shows several matches.

The important one seems to be ops.c which includes a switch (cclass) statement which in turn includes cases for MVM_CCLASS_ALPHABETIC and MVM_CCLASS_WORD that correspond to NQP's similarly named constants.

According to the code's comments:

CCLASS_ALPHABETIC currently matches exactly the same characters as the full Raku or NQP <:L> rule, i.e. the characters Unicode has classified as "Letters".

I think that means <alpha> is equivalent to the union of CCLASS_ALPHABETIC and _ (underscore).

CCLASS_WORD matches the same plus <:Nd>, i.e. decimal digits (in any human language, not just English).

I think that means the Raku / NQP <alnum> rule is equivalent to CCLASS_WORD.

The specification and "specification" of ident

The official specification of Raku is embodied in roast[2].

A search of roast for ident shows several matches.

Most use <ident> only incidentally, as part of testing something else. The specification requires that they work as shown, but you won't understand what <ident> is supposed to do by looking at incidental usage.

Three tests clearly test <ident> itself. One of those is essentially redundant, leaving two. I see no changes between the 6.c and 6.c.errata versions of these two matches:

From S05-mass/rx.t:

ok ('2+3 ab2' ~~ /<ident>/) && matchcheck($/, q/mob<ident>: <ab2 @ 4>/), 'capturing builtin <ident>';

ok tests that its first argument returns True. This call tests that <ident> skips 2+3 and matches ab2.

From S05-mass/charsets.t:

is $latin-chars.comb(/<ident>/).join(" "), "ABCDEFGHIJKLMNOPQRSTUVWXYZ _ abcdefghijklmnopqrstuvwxyz ª µ º ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö øùúûüýþÿ", 'ident chars';

is tests that its first argument matches its second. This call tests what the ident rule matches from a string consisting of the first 256 Unicode codepoints (the Latin-1 character set).

Here's a variation of this test that more clearly shows the matching that happens:

say ~$_ for $latin-chars ~~ m:g/<ident>/;

prints:

ABCDEFGHIJKLMNOPQRSTUVWXYZ
_
abcdefghijklmnopqrstuvwxyz
ª
µ
º
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
øùúûüýþÿ

But <ident> will match a whole lot more than just a hundred or so characters from Latin-1. So, while the above tests cover what <ident> is officially specified/tested to match, they clearly don't cover the full picture.

So let's look at the official speculation that may, with care, be considered related to "specification".

First, we note the warning at the top:

Note: these documents may be out of date.
For Perl 6 documentation see docs.perl6.org;
for specs, see the official test suite.

The term "specs" in this warning is short for "specification". As already explained, the official specification test suite is roast, not any human language verbiage.

(Some people still think of these historical design docs as "specifications" too, and refer to them as "specs", but the official view is that "specs", as applied to the design docs, should be considered to be short for "speculations" to emphasize that they are not something to be fully relied upon.)

A search for ident in design.raku.org shows several matches.

The most useful match is in the Predefined Subrules section of S05:

These are some of the predefined subrules for any grammar or regex:

  • ident ... Match an identifier.

Uhoh...

(Corrections of) documentation of <ident>, "character class" and "identifier"

From Predefined character classes in the official doc:

    Class                             Description
    <ident>                           Identifier. Also a default rule.

This is misleading in three ways:

  • ident is not a character class. Character classes match a single character in that character class; if used with a quantifier they just match a string of such characters, each of which can be any character from that class. In contrast <ident> matches a particular pattern of characters. It may be one character but you can't control that; the rule is greedy, matching as many characters fit the pattern. If you apply a quantifier it controls repetition of the overall rule, not how many characters are included in a single match of the rule.

  • All built in rules are default rules. I think the default comment is there to emphasize that you can write your own ident rule if you don't like the built-in pattern. This is true for all rules though it will typically make much less sense to override built ins such as canonical character classes like <lower> (lowercase).

  • ident does not match identifiers! Or, more accurately, it doesn't do so on its own for most Raku identifiers. See the next section for the details.

ident vs Raku identifiers

my @Identifiers = < $bar %hash Foo Foo::Bar your_ident anothers' my-ident >; 
say (~$/ if m/^<ident>$/ for @Identifiers); # (Foo your_ident)
say (~$/ if m/ <ident> / for @Identifiers); # (bar hash Foo Foo your_ident anothers my)

In nqp's grammar, which is defined in NQP's Grammar.nqp, there's:

token identifier { <.ident> [ <[\-']> <.ident> ]* }

In Raku's grammar, which is defined in Rakudo's Grammar.nqp, there's code that looks slightly different but has the exact same effect:

token apostrophe { <[ ' \- ]> }
token identifier { <.ident> [ <.apostrophe> <.ident> ]* }

So <identifier> matches a pattern that includes one or more <ident>s with <apostrophe>s in between.

The ident method is in NQPMatchRole, which means it's a built-in that's part of the rule namespace of users' grammars.

But the identifier methods are not exported by either Raku or nqp. So they are not part of the rule namespace of users' grammars.

If we write our own indentifier token we can see it in action:

my token identifier { <.ident> [ <[\-']> <.ident> ]* }
my token sigil { <[$@%&]> }
say (~$/ if m/^ <sigil>? <identifier> $/ for @Identifiers)

displays:

($bar %hash Foo your_ident my-ident)

To summarize the above and some other considerations:

  • <ident> matches just parts of what <identifier> matches (though they're the same for simple names). Consider is-prime. This is a Raku identifier but contains two <ident> matches (is and prime).

  • <identifier> matches just parts of "Raku identifiers" (though they're the same for simple names). Consider infix:<+>. This is sometimes referred to as a Raku identifier but requires both an <identifier> match and a match of :<+>.

  • Raku identifiers are themselves just parts of names (though they're the same for the simplest names). Consider Foo-Bar::Baz-Qux which contains two <identifier> matches (each in turn containing two <ident> matches).

Footnotes

[1] If you're not sure what a capture is, see Capturing, Named captures and Subrules.

[2] The official specification of Raku is a test suite called roast -- the Repository Of All Specification Tests. The latest version of a specific branch of roast defines a specific version of Raku. When I first wrote this answer there had only been two official branches/versions of roast, and therefore of Raku. The first was 6.c aka 6.Christmas. This was cut on Christmas day 2015 and has been deliberately left frozen since that day. The second was 6.c.errata, which conservatively added corrections to 6.c deemed sufficiently important and backwards compatible to be included in the (then) current official recommended version of Raku. An "officially compliant" Raku compiler passes some official branch of roast. The Rakudo compiler (then) passed 6.c.errata. If you read all the tests involving a feature in, say, the 6.c.errata branch of roast, then you'll have read a full definition of the officially specified meaning of that feature for the 6.c.errata version of the Raku language.

这篇关于&lt; .ident&gt; perl6语法中的功能/捕获的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆