< .ident> perl6语法中的功能/捕获 [英] <.ident> function/capture in perl6 grammars
问题描述
在阅读perl6的Xml语法时( https://github.com/supernovus/exemel/blob/master/lib/XML/Grammar.pm6 ),我在理解以下令牌时遇到了一些困难.
token pident {
<!before \d> [ \d+ <.ident>* || <.ident>+ ]+ % '-'
}
更具体地说是< .ident>,没有ident的其他定义,因此我假设它是保留术语.虽然我找不到在perl6.org上找到合适的定义.有人知道这意味着什么吗?
TL; DR 我将从一个准确而相对简洁的答案开始.该答案的其余部分适用于那些希望全面了解内置规则和/或特别要深入研究ident
的人.
<.ident>
函数/捕获
由于.
,<.ident>
仅匹配,因此不捕获 [1] .对于此答案的其余部分,我通常会省略.
,因为它除了捕获方面对规则的含义没有影响.
就像您可以在编程语言中在另一个函数的声明中调用(即调用")一个函数一样,因此您也可以调用规则/令牌/正则表达式/方法(此后,我通常只使用术语规则" )在另一条规则的声明中. <foo>
是用于调用名为foo
的规则的语法;因此<ident>
调用名为ident
的(方法).
在撰写本文时,XML::Grammar
语法本身并未定义/声明名为ident
的规则.这意味着该呼叫最终被调度到具有该名称的内置声明中.
内置的ident
规则与声明为:
token ident {
[ <alpha> ]
[ <alnum> ]*
}
官方预定义字符类文档应提供<alpha>
和<alnum>
的精确定义.或者,相关详细信息也将在以后的答案中提供.
最重要的是,ident
与一个或多个字母数字"字符的字符串匹配,但第一个字符不能为数字".
因此abc
或def123
都匹配,而123abc
不匹配.
此答案的其余部分
对于那些有兴趣详细了解的人,我写了以下部分:
-
Raku (标准语言和课程详细信息)
-
Rakudo (高级实现)
-
NQP (中级实施)
-
MoarVM (低级别实现)
-
ident
的规范和规范"
-
<ident>
,字符类"和标识符"的文档(更正) -
ident
与Raku 标识符
Raku (标准语言和课程详细信息)
XML::Grammar
是用户定义的Raku语法. Raku语法是一门课. (语法实际上只是一些专门的类"..)
Raku规则是正则表达式是一种方法:
grammar foo { rule ident { ... } }
say foo.^lookup('ident').WHAT; # (Regex)
say Regex ~~ Method; # True
调用 call (如语法中的<ident>
) rel ="nofollow noreferrer"> .parse
或类似的语法. .parse
调用根据语法规则匹配输入字符串.
在比赛期间评估在XML::Grammar
中出现的<ident>
时,结果是对XML::Grammar
实例的ident
方法(规则)调用(.parse
调用创建其实例).如果只是一个类型对象,就会引起骚动.
因为XML::Grammar
本身没有定义该名称的规则/方法,所以ident
调用而是根据标准方法解析规则来分派的. (我在这里使用的是非Raku的一般意义上的规则"一词.啊,语言.)
在Raku中,使用grammar foo { ... }
形式的声明创建的任何类都会自动从Grammar
类继承,而该类又从Match
类继承:
say .^mro given grammar foo {} # ((foo) (Grammar) (Match) (Capture) (Cool) (Any) (Mu))
在内置的Match
类中找到
ident
.
Rakudo(高级实现)
在Rakudo编译器中, Match
类 does
角色 NQPMatchRole
.
此NQPMatchRole
是找到ident
的最高级别实现的地方.
NQP(中级实施)
NQPMatchRole
用nqp语言编写,Raku的一个子集用于引导整个Raku,并且ident声明, first 字符的匹配可归结为:
nqp::ord($target, $!pos) == 95
|| nqp::iscclass(nqp::const::CCLASS_ALPHABETIC, $target, $!pos)
如果 first 字符是 或_
(95
是下划线的ASCII代码/Unicode代码点)或与NQP中定义的称为CCLASS_ALPHABETIC
的字符类匹配的字符.
其他显着代码是:
nqp::findnotcclass( nqp::const::CCLASS_WORD
这与字符类CCLASS_WORD
中的零个或多个后续字符匹配.
对CCLASS_ALPHABETIC
的NQP的搜索显示了多个匹配项.最有用的似乎是 NQP测试文件一个>.尽管此文件清楚地表明CCLASS_WORD
是CCLASS_ALPHABETIC
的超集,但它并不能弄清楚这些类实际上匹配什么.
NQP针对多个后端"或具体的虚拟机.鉴于Rakudo/NQP doc/测试对这些规则和字符类的实际匹配程度相对较少,因此必须查看其后端之一以验证什么.
MoarVM (底层实现)
MoarVM 是唯一受官方支持的后端.
对MoarVM进行的CCLASS
搜索显示了多个匹配项. /p>
重要的似乎是 ops.c 其中包括switch (cclass)
语句依次包含MVM_CCLASS_ALPHABETIC
和MVM_CCLASS_WORD
的情况,这些情况与NQP的类似命名常量相对应.
根据代码的注释:
CCLASS_ALPHABETIC
当前与完全Raku或NQP完全相同的字符匹配 <:L>
规则,即Unicode字符已归类为字母".
我认为这意味着<alpha>
等同于CCLASS_ALPHABETIC
和_
的并集(下划线).
CCLASS_WORD
匹配相同的加号<:Nd>
,即十进制数字(使用任何人类语言,而不仅仅是英语).
我认为这意味着Raku/NQP <alnum>
规则等同于CCLASS_WORD
.
ident
的规范和规范"
Raku的官方规范体现在 roast [2] 中.
搜索ident
的烘烤显示了多个匹配项.
大多数情况下,仅将<ident>
用作测试其他内容的一部分.规范要求它们必须按所示方式工作,但是通过查看偶然的用法,您将无法理解<ident>
应该做什么.
三个测试清楚地测试了<ident>
本身.其中之一本质上是多余的,剩下两个.我发现这两个匹配项的6.c
和6.c.errata
版本之间没有任何变化:
来自 S05-mass/rx.t :
ok ('2+3 ab2' ~~ /<ident>/) && matchcheck($/, q/mob<ident>: <ab2 @ 4>/), 'capturing builtin <ident>';
ok
测试其第一个参数是否返回True
.此调用测试<ident>
跳过2+3
并匹配ab2
.
来自 S05-mass/charsets.t :
is $latin-chars.comb(/<ident>/).join(" "), "ABCDEFGHIJKLMNOPQRSTUVWXYZ _ abcdefghijklmnopqrstuvwxyz ª µ º ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö øùúûüýþÿ", 'ident chars';
is
测试其第一个参数与第二个参数匹配.此调用测试ident
规则与由前256个Unicode代码点(拉丁1字符集)组成的字符串匹配的内容.
这是此测试的变体,可以更清楚地显示发生的匹配:
say ~$_ for $latin-chars ~~ m:g/<ident>/;
打印:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
_
abcdefghijklmnopqrstuvwxyz
ª
µ
º
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
øùúûüýþÿ
但是<ident>
匹配的内容不仅仅来自Latin-1中的大约一百个字符.因此,尽管以上测试涵盖了
因此,让我们看一下官方推测 被认为与规范"有关.
首先,我们注意到顶部的警告:
Note: these documents may be out of date.
For Perl 6 documentation see docs.perl6.org;
for specs, see the official test suite.
此警告中的规格"一词是规格"的简称.正如已经说明的那样,官方规范测试套件是 roast ,而不是任何人类语言.
(有些人仍然将这些历史设计文档也视为规格",并将它们称为规格",但官方观点是,应将适用于设计文档的规格"视为推测"的缩写,强调它们不是要完全依赖的东西.)
这些是任何语法或正则表达式的一些预定义子规则:
- ident ...匹配标识符.
呃...
<ident>
,字符类"和标识符"的文档(更正)
来自 官方文档中的预定义字符类 :
Class Description
<ident> Identifier. Also a default rule.
这在三种方式上具有误导性:
-
ident
是不是 字符类.字符类与该字符类中的单个字符匹配;如果与量词一起使用,它们仅匹配一串这样的字符,每个字符都可以是该类中的任何字符.相反,<ident>
匹配特定的字符模式.它可能是一个字符,但您无法控制.规则是贪婪的,匹配符合模式的字符数.如果您应用量词,它将控制整个规则的重复,而不是规则的单个匹配中包含多少个字符. -
所有内置规则均为默认规则.我认为默认注释是为了强调,如果您不喜欢内置模式,则可以编写自己的
ident
规则.这对于所有规则都是正确的,尽管通常覆盖诸如<lower>
(小写)之类的规范字符类之类的内置指令的意义要小得多. -
ident
不与标识符匹配!或者,更准确地说,对于大多数Raku标识符而言,它并不是单独这样做的.有关详细信息,请参见下一部分.
ident
与Raku 标识符
my @Identifiers = < $bar %hash Foo Foo::Bar your_ident anothers' my-ident >;
say (~$/ if m/^<ident>$/ for @Identifiers); # (Foo your_ident)
say (~$/ if m/ <ident> / for @Identifiers); # (bar hash Foo Foo your_ident anothers my)
在nqp的语法中,该语法在NQP的语法中定义. nqp ,有:
token identifier { <.ident> [ <[\-']> <.ident> ]* }
使用Rakudo的语法定义的Raku语法. nqp ,有些代码看起来略有不同,但效果完全相同:
token apostrophe { <[ ' \- ]> }
token identifier { <.ident> [ <.apostrophe> <.ident> ]* }
因此,<identifier>
匹配包含一个或多个<ident>
且介于两者之间的<apostrophe>
的模式.
ident
方法位于NQPMatchRole
中,这意味着它是内置的,属于用户语法的规则名称空间.
但是identifier
方法不是由Raku或nqp导出的.因此,它们不是用户语法的规则命名空间的一部分.
如果我们编写自己的indentifier
令牌,我们可以看到它的作用:
my token identifier { <.ident> [ <[\-']> <.ident> ]* }
my token sigil { <[$@%&]> }
say (~$/ if m/^ <sigil>? <identifier> $/ for @Identifiers)
显示:
($bar %hash Foo your_ident my-ident)
总结以上以及其他一些注意事项:
-
<ident>
仅匹配<identifier>
匹配项的部分(尽管简单名称相同).考虑is-prime
.这是Raku标识符,但包含两个<ident>
匹配项(is
和prime
). -
<identifier>
仅匹配"Raku标识符"的部分(尽管它们与简单名称相同).考虑infix:<+>
.有时称为Raku标识符,但需要同时匹配<identifier>
和:<+>
. -
Raku标识符本身只是名称的部分(尽管最简单的名称相同).考虑包含两个
<identifier>
个匹配项的Foo-Bar::Baz-Qux
(每个匹配项又包含两个<ident>
匹配项).
脚语
[1] 如果您不确定捕获的内容,请参见命名捕获和 roast 的特定分支的最新版本定义了Raku的特定版本.当我第一次写这个答案时,只有两个 official 分支/版本的烤肉,因此也只有Raku.第一个是 6.c
aka 6.Christmas
.这是在2015年圣诞节那天切割的,并已故意从那天起就冻结了.第二个是6.c.errata
,保守地对被认为足够重要且向后兼容的6.c
添加了更正,以包括在(当时)当前的官方推荐版本的Raku中.一个官方兼容的" Raku编译器通过了一些正式的烘烤程序. Rakudo编译器(然后)通过6.c.errata
.例如,如果您阅读了烤肉的6.c.errata
分支中涉及某个功能的所有测试,那么您将阅读该功能的正式指定的完整定义,该定义对于6.c.errata
版本的Raku语言.
While reading the Xml grammar for perl6 (https://github.com/supernovus/exemel/blob/master/lib/XML/Grammar.pm6), I am having some difficulties understanding the following token.
token pident {
<!before \d> [ \d+ <.ident>* || <.ident>+ ]+ % '-'
}
More specifically <.ident>, there are no other definitions of ident, so I am assuming it is a reserved term. Though i cant find find a proper definition on perl6.org. Does anyone know what this means?
TL;DR I'll start with a precise and relatively concise answer. The rest of this answer is for those wanting to know more about built in rules in general and/or to drill down into ident
in particular.
<.ident>
function/capture
Because of the .
, <.ident>
only matches, it doesn't capture[1]. For the rest of this answer I'll generally omit the .
because it makes no difference to a rule's meaning besides the capture aspect.
Just as you can invoke (aka "call") one function within the declaration of another in programming languages, so too you can invoke a rule/token/regex/method (hereafter I'll generally just use the term "rule") within the declaration of another rule. <foo>
is the syntax used to invoke a rule named foo
; so <ident>
invokes a (method) namedident
.
At the time I write this, XML::Grammar
grammar does not itself define/declare a rule named ident
. That means the call ends up dispatched to a built in declaration with that name.
The built in ident
rule does precisely the same as if it were declared as:
token ident {
[ <alpha> ]
[ <alnum> ]*
}
The official Predefined character classes doc should provide precise definitions of <alpha>
and <alnum>
. Alternatively, the relevant details are also included later on in this answer.
The bottom line is that ident
matches a string of one or more "alphanumeric" characters except that the first character cannot be a "number".
Thus both abc
or def123
match whereas 123abc
does not.
The rest of this answer
For those interested in detail worth knowing I've written the following sections:
Raku (standard language and class details)
Rakudo (high level implementation)
NQP (mid level implementation)
MoarVM (low level implementation)
The specification and "specification" of
ident
(Corrections of) documentation of
<ident>
, "character class" and "identifier"ident
vs Raku identifiers
Raku (standard language and class details)
XML::Grammar
is a user defined Raku grammar. A Raku grammar is a class. ("Grammars are really just slightly specialized classes".)
A Raku rule is a regex is a method:
grammar foo { rule ident { ... } }
say foo.^lookup('ident').WHAT; # (Regex)
say Regex ~~ Method; # True
A rule call, like <ident>
, in a grammar, is typically invoked as a result of calling .parse
or similar on the grammar. The .parse
call matches the input string according to the rules in the grammar.
When an occurrence of <ident>
within XML::Grammar
is evaluated during a match, the result is an ident
method (rule) call on an instance of XML::Grammar
(the .parse
call creates an instance of its invocant if it's just a type object).
Because XML::Grammar
does not itself define a rule/method of that name, the ident
call is instead dispatched according to standard method resolution, er, rules. (I'm using the word "rules" here in the generic non-Raku specific sense. Ah, language.)
In Raku, any class created using a declaration of the form grammar foo { ... }
automatically inherits from the Grammar
class which in turn inherits from the Match
class:
say .^mro given grammar foo {} # ((foo) (Grammar) (Match) (Capture) (Cool) (Any) (Mu))
ident
is found in the built in Match
class.
Rakudo (high level implementation)
In the Rakudo compiler, the Match
class does
the role NQPMatchRole
.
This NQPMatchRole
is where the highest level implementation of ident
is found.
NQP (mid level implementation)
NQPMatchRole
is written in the nqp language, a subset of Raku used to bootstrap the full Raku, and the heart of NQP, a compiler toolkit.
Excerpting and reformatting just the most salient code from the ident
declaration, the match for the first character boils down to:
nqp::ord($target, $!pos) == 95
|| nqp::iscclass(nqp::const::CCLASS_ALPHABETIC, $target, $!pos)
This matches if the first character is either a _
(95
is the ASCII code / Unicode codepoint for an underscore) or a character matching a character class defined in NQP called CCLASS_ALPHABETIC
.
The other bit of salient code is:
nqp::findnotcclass( nqp::const::CCLASS_WORD
This matches zero or more subsequent characters in the character class CCLASS_WORD
.
A search of NQP for CCLASS_ALPHABETIC
shows several matches. The most useful seems to be an NQP test file. While this file makes it clear that CCLASS_WORD
is a superset of CCLASS_ALPHABETIC
, it doesn't make it clear what those classes actually match.
NQP targets multiple "backends" or concrete virtual machines. Given the relative paucity of Rakudo/NQP doc/tests of what these rules and character classes actually match, one has to look at one of its backends to verify what's what.
MoarVM (low level implementation)
MoarVM is the only officially supported backend.
A search of MoarVM for CCLASS
shows several matches.
The important one seems to be ops.c which includes a switch (cclass)
statement which in turn includes cases for MVM_CCLASS_ALPHABETIC
and MVM_CCLASS_WORD
that correspond to NQP's similarly named constants.
According to the code's comments:
CCLASS_ALPHABETIC
currently matches exactly the same characters as the full Raku or NQP <:L>
rule, i.e. the characters Unicode has classified as "Letters".
I think that means <alpha>
is equivalent to the union of CCLASS_ALPHABETIC
and _
(underscore).
CCLASS_WORD
matches the same plus <:Nd>
, i.e. decimal digits (in any human language, not just English).
I think that means the Raku / NQP <alnum>
rule is equivalent to CCLASS_WORD
.
The specification and "specification" of ident
The official specification of Raku is embodied in roast[2].
A search of roast for ident
shows several matches.
Most use <ident>
only incidentally, as part of testing something else. The specification requires that they work as shown, but you won't understand what <ident>
is supposed to do by looking at incidental usage.
Three tests clearly test <ident>
itself. One of those is essentially redundant, leaving two. I see no changes between the 6.c
and 6.c.errata
versions of these two matches:
From S05-mass/rx.t:
ok ('2+3 ab2' ~~ /<ident>/) && matchcheck($/, q/mob<ident>: <ab2 @ 4>/), 'capturing builtin <ident>';
ok
tests that its first argument returns True
. This call tests that <ident>
skips 2+3
and matches ab2
.
From S05-mass/charsets.t:
is $latin-chars.comb(/<ident>/).join(" "), "ABCDEFGHIJKLMNOPQRSTUVWXYZ _ abcdefghijklmnopqrstuvwxyz ª µ º ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö øùúûüýþÿ", 'ident chars';
is
tests that its first argument matches its second. This call tests what the ident
rule matches from a string consisting of the first 256 Unicode codepoints (the Latin-1 character set).
Here's a variation of this test that more clearly shows the matching that happens:
say ~$_ for $latin-chars ~~ m:g/<ident>/;
prints:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
_
abcdefghijklmnopqrstuvwxyz
ª
µ
º
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö
øùúûüýþÿ
But <ident>
will match a whole lot more than just a hundred or so characters from Latin-1. So, while the above tests cover what <ident>
is officially specified/tested to match, they clearly don't cover the full picture.
So let's look at the official speculation that may, with care, be considered related to "specification".
First, we note the warning at the top:
Note: these documents may be out of date.
For Perl 6 documentation see docs.perl6.org;
for specs, see the official test suite.
The term "specs" in this warning is short for "specification". As already explained, the official specification test suite is roast, not any human language verbiage.
(Some people still think of these historical design docs as "specifications" too, and refer to them as "specs", but the official view is that "specs", as applied to the design docs, should be considered to be short for "speculations" to emphasize that they are not something to be fully relied upon.)
A search for ident
in design.raku.org shows several matches.
The most useful match is in the Predefined Subrules section of S05:
These are some of the predefined subrules for any grammar or regex:
- ident ... Match an identifier.
Uhoh...
(Corrections of) documentation of <ident>
, "character class" and "identifier"
From Predefined character classes in the official doc:
Class Description
<ident> Identifier. Also a default rule.
This is misleading in three ways:
ident
is not a character class. Character classes match a single character in that character class; if used with a quantifier they just match a string of such characters, each of which can be any character from that class. In contrast<ident>
matches a particular pattern of characters. It may be one character but you can't control that; the rule is greedy, matching as many characters fit the pattern. If you apply a quantifier it controls repetition of the overall rule, not how many characters are included in a single match of the rule.All built in rules are default rules. I think the default comment is there to emphasize that you can write your own
ident
rule if you don't like the built-in pattern. This is true for all rules though it will typically make much less sense to override built ins such as canonical character classes like<lower>
(lowercase).ident
does not match identifiers! Or, more accurately, it doesn't do so on its own for most Raku identifiers. See the next section for the details.
ident
vs Raku identifiers
my @Identifiers = < $bar %hash Foo Foo::Bar your_ident anothers' my-ident >;
say (~$/ if m/^<ident>$/ for @Identifiers); # (Foo your_ident)
say (~$/ if m/ <ident> / for @Identifiers); # (bar hash Foo Foo your_ident anothers my)
In nqp's grammar, which is defined in NQP's Grammar.nqp, there's:
token identifier { <.ident> [ <[\-']> <.ident> ]* }
In Raku's grammar, which is defined in Rakudo's Grammar.nqp, there's code that looks slightly different but has the exact same effect:
token apostrophe { <[ ' \- ]> }
token identifier { <.ident> [ <.apostrophe> <.ident> ]* }
So <identifier>
matches a pattern that includes one or more <ident>
s with <apostrophe>
s in between.
The ident
method is in NQPMatchRole
, which means it's a built-in that's part of the rule namespace of users' grammars.
But the identifier
methods are not exported by either Raku or nqp. So they are not part of the rule namespace of users' grammars.
If we write our own indentifier
token we can see it in action:
my token identifier { <.ident> [ <[\-']> <.ident> ]* }
my token sigil { <[$@%&]> }
say (~$/ if m/^ <sigil>? <identifier> $/ for @Identifiers)
displays:
($bar %hash Foo your_ident my-ident)
To summarize the above and some other considerations:
<ident>
matches just parts of what<identifier>
matches (though they're the same for simple names). Consideris-prime
. This is a Raku identifier but contains two<ident>
matches (is
andprime
).<identifier>
matches just parts of "Raku identifiers" (though they're the same for simple names). Considerinfix:<+>
. This is sometimes referred to as a Raku identifier but requires both an<identifier>
match and a match of:<+>
.Raku identifiers are themselves just parts of names (though they're the same for the simplest names). Consider
Foo-Bar::Baz-Qux
which contains two<identifier>
matches (each in turn containing two<ident>
matches).
Footnotes
[1] If you're not sure what a capture is, see Capturing, Named captures and Subrules.
[2] The official specification of Raku is a test suite called roast -- the Repository Of All Specification Tests. The latest version of a specific branch of roast defines a specific version of Raku. When I first wrote this answer there had only been two official branches/versions of roast, and therefore of Raku. The first was 6.c
aka 6.Christmas
. This was cut on Christmas day 2015 and has been deliberately left frozen since that day. The second was 6.c.errata
, which conservatively added corrections to 6.c
deemed sufficiently important and backwards compatible to be included in the (then) current official recommended version of Raku. An "officially compliant" Raku compiler passes some official branch of roast. The Rakudo compiler (then) passed 6.c.errata
. If you read all the tests involving a feature in, say, the 6.c.errata
branch of roast, then you'll have read a full definition of the officially specified meaning of that feature for the 6.c.errata
version of the Raku language.
这篇关于< .ident> perl6语法中的功能/捕获的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!