perl6语法,不确定示例中的某些语法 [英] perl6 grammar , not sure about some syntax in an example

查看:88
本文介绍了perl6语法,不确定示例中的某些语法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我仍在学习perl6,并且正在从此页面阅读语法示例: http://examples.perl6.org/categories/parsers/SimpleStrings.html ;我已经多次阅读了有关正则表达式的文档,但是仍然有些我不理解的语法.

I am still learning perl6, and I am reading the example on grammar from this page: http://examples.perl6.org/categories/parsers/SimpleStrings.html ; I have read the documentations on regex multiple times, but there are still some syntax that I don't understand.

token string { <quote> {} <quotebody($<quote>)> $<quote> }

问题1:令牌中的"{}"在做什么?捕获标记为<()>,嵌套结构为tilda'('〜')';但是{}是什么?

Question 1: what is this "{}" in the token doing? Capture marker is <()>, and nesting structures is tilda '(' ~ ')'; but what is {} ?

token quotebody($quote) { ( <escaped($quote)> | <!before $quote> . )* }

问题2a:<>中的转义的($ quote)将是正则表达式函数,对吗?并以$ quote作为参数并返回另一个正则表达式?

Question 2a: escaped($quote) inside <> would be a regex function, right? And it takes $quote as an argument and returns another regex ?

问题2b:如果我想表示不在报价前的字符",我应该使用.< ;!之前$ quote>"而不是< ;!之前$ quote>". ??

Question 2b: If I want to indicate "char that is not before quote", should I use ". <!before $quote>" instead of "<!before $quote> ." ??

token escaped($quote) { '\\' ( $quote | '\\' ) } # I think this is a function;

推荐答案

TL; DR @briandfoy提供了易于理解的答案.但是这里有他没有提到的巨龙.还有漂亮的蝴蝶.这个答案很深.

TL;DR @briandfoy has provided an easy to digest answer. But here be dragons that he didn't mention. And pretty butterflies too. This answer goes deep.

问题1:令牌中的{}在做什么?

这是一个代码块 1,2,3,4 .

这是一个空的,纯粹是为了强制quotebody($<quote>)中的$<quote>插入正则表达式开始时计算为<quote>捕获的值.

It's an empty one and has been inserted purely to force the $<quote> in quotebody($<quote>) to evaluate to the value captured by the <quote> at the start of the regex.

$<quote>在没有插入代码块的情况下包含正确值的原因是Rakudo Perl 6编译器限制或与匹配变量的发布"有关的错误.

The reason why $<quote> does not contain the right value without insertion of a code block is a Rakudo Perl 6 compiler limitation or bug related to "publication of match variables".

莫里茨·伦茨(Moritz Lenz)在 Rakudo错误报告中指出 除非有必要,否则正则表达式引擎不会发布匹配变量" .

Moritz Lenz states in a Rakudo bug report that "the regex engine doesn't publish match variables unless it is deemed necessary".

他用"regex引擎"表示NQP中的regex/语法引擎,是Rakudo Perl 6编译器的一部分. 3

By "regex engine" he means the regex/grammar engine in NQP, part of the Rakudo Perl 6 compiler.3

匹配变量"是指存储匹配结果捕获的变量:

By "match variables", he means the variables that store captures of match results:

  • 当前匹配变量 $/;

编号子匹配项变量$0$1等;

命名的子匹配变量,其形式为$<foo>.

named sub-match variables of the form $<foo>.

通过发布",他表示正则表达式/语法引擎会执行所需的操作,以便对正则表达式中的任何变量(令牌也为正则表达式)的任何提及都将评估为它们应该具有的值.应该有他们.在给定的正则表达式中,匹配变量应该包含 Match对象对应于在处理该正则表达式的任何给定阶段为他们捕获的内容;如果没有捕获到任何内容,则为Nil.

By "publish" he means that the regex/grammar engine does what it takes so that any mentions of any variables in a regex (a token is also a regex) evaluate to the values they're supposed to have when they're supposed to have them. Within a given regex, match variables are supposed to contain a Match object corresponding to what has been captured for them at any given stage in processing of that regex, or Nil if nothing has been captured.

被认为是必需的",他表示正则表达式/语法引擎对匹配过程中的每个步骤之后是否值得进行发布工作进行了保守的呼吁. 保守"是指引擎通常避免发布,因为它会使事情变慢并且通常是不必要的.不幸的是,有时对于何时实际需要发布 过于乐观.因此,程序员有时需要通过显式插入代码块来强制发布匹配变量(以及其他变量 5 的其他技术)进行干预.正则表达式/语法引擎有可能随着时间的推移在这方面有所改进,从而减少了需要人工干预的情况.如果您希望帮助解决此问题,请创建与您现有的相关错误相关的测试用例. 5

By "deemed necessary" he means that the regex/grammar engine makes a conservative call about whether it's worth doing the publication work after each step in the matching process. By "conservative" I mean that the engine often avoids doing publication, because it slows things down and is usually unnecessary. Unfortunately it's sometimes too optimistic about when publication is actually necessary. Hence the need for programmers to sometimes intervene by explicitly inserting a code block to force publication of match variables (and other techniques for other variables5). It's possible that the regex/grammar engine will improve in this regard over time, reducing the scenarios in which manual intervention is necessary. If you wish to help progress this, please create test cases that matter to you for existing related bugs.5

命名捕获$<quote>就是这种情况.

The named capture $<quote> is the case in point here.

据我所知,所有子匹配变量在没有周围构造的情况下直接写入正则表达式时都正确地引用了它们的捕获值.这有效:

As far as I can tell, all sub-match variables correctly refer to their captured value when written directly into the regex without a surrounding construct. This works:

my regex quote { <['"]> }
say so '"aa"' ~~ / <quote> aa $<quote> /; # True

我认为 6 $<quote>获得正确的值是因为它被解析为 regex slang 构造. 4

I think6 $<quote> gets the right value because it is parsed as a regex slang construct.4

相反,如果{}已从

token string { <quote> {} <quotebody($<quote>)> $<quote> }

然后quotebody($<quote>)中的$<quote>包含开头<quote>捕获的值.

then the $<quote> in quotebody($<quote>) would not contain the value captured by the opening <quote>.

我认为这是因为在这种情况下,$<quote>被解析为 main s语构造.

I think this is because the $<quote> in this case is parsed as a main slang construct.

问题2a:<>内的escaped($quote)是正则表达式函数,对吗?它以$quote作为参数

Question 2a: escaped($quote) inside <> would be a regex function, right? And it takes $quote as an argument

这是一个很好的第一近似值.

That's a good first approximation.

更具体地说,形式为<foo(...)>的正则表达式原子是方法 foo的调用.

More specifically, regex atoms of the form <foo(...)> are calls of the method foo.

所有正则表达式-无论以tokenregexrule/.../或任何其他形式声明-都是方法.但是用method声明的方法不是 正则表达式:

All regexes -- whether declared with token, regex, rule, /.../ or any other form -- are methods. But methods declared with method are not regexes:

say Method ~~ Regex; # False
say WHAT token { . } # (Regex)
say Regex ~~ Method; # True
say / . / ~~ Method; # True

遇到<escaped($quote)>正则表达式原子时,正则表达式/语法引擎不知道或不在乎escaped是否为正则表达式,也不关心

When the <escaped($quote)> regex atom is encountered, the regex/grammar engine doesn't know or care if escaped is a regex or not, nor about the details of method dispatch inside a regex or grammar. It just invokes method dispatch, with the invocant set to the Match object that's being constructed by the enclosing regex.

该调用将控制权放到最终运行该方法的任何位置.事实证明,正则表达式/语法引擎只是递归地回调自身,因为通常这是一个正则表达式调用另一个正则表达式的问题.但这并不一定.

The call yields control to whatever ends up running the method. It typically turns out that the regex/grammar engine is just recursively calling back into itself because typically it's a matter of one regex calling another. But it isn't necessarily so.

并返回另一个正则表达式

and returns another regex

否,形式为<escaped($quote)>的正则表达式原子不会返回另一个正则表达式.

No, a regex atom of the form <escaped($quote)> does not return another regex.

相反,它调用将/应该返回Match对象的方法.

Instead it calls a method that will/should return a Match object.

如果调用的方法是正则表达式,则P6将确保正则表达式自动生成并填充Match对象.

If the method called was a regex, P6 will make sure the regex generates and populates the Match object automatically.

如果所调用的方法不是正则表达式,而是普通方法,则该方法的代码应已手动创建并返回了Match对象.莫里茨(Moritz)在他对SO问题的回答中显示了一个例子 Can我可以在方法中更改Perl 6 s语吗?.

If the method called was not a regex but instead just an ordinary method, then the method's code should have manually created and returned a Match object. Moritz shows an example in his answer to the SO question Can I change the Perl 6 slang inside a method?.

Match对象返回到驱动正则表达式匹配/语法分析的正则表达式/语法引擎". 3

The Match object is returned to the "regex/grammar engine" that drives regex matching / grammar parsing.3

引擎然后根据结果决定下一步该做什么:

The engine then decides what to do next according to the result:

  • 如果匹配成功,则引擎将更新与调用正则表达式相对应的总体匹配对象.更新可以包括将返回的Match对象保存为调用正则表达式的子匹配捕获.这就是构建匹配/解析的 tree 的方式.

  • If the match was successful, the engine updates the overall match object corresponding to the calling regex. The updating may include saving the returned Match object as a sub-match capture of the calling regex. This is how a match/parse tree gets built.

如果匹配失败 ,则引擎可能会回溯,撤消之前的更新;因此,解析树可能会随着匹配的进行而动态地增长和缩小.

If the match was unsuccessful, the engine may backtrack, undoing previous updates; thus the parse tree may dynamically grow and shrink as matching progresses.

问题2b:如果我想表示不在报价前的字符",我应该使用. <!before $quote>而不是<!before $quote> . ??

是的

但这不是quotebody正则表达式所需要的,如果那是您在说的.

But that's not what's needed for the quotebody regex, if that's what you're talking about.

关于后一个主题,在@briandfoy的回答中,他建议使用匹配...任何不是引号的内容"构造,而不是对前面的内容进行消极的观察(<!before $quote>).他的观点是,与不是引号"相匹配比不是在引号之前?然后匹配任何字符"更容易理解.

While on the latter topic, in @briandfoy's answer he suggests using a "Match ... anything that's not a quote" construct rather than doing a negative look ahead (<!before $quote>). His point is that matching "not a quote" is much easier to understand than "are we not before a quote? then match any character".

但是,如果引号是一个变量,并且其值设置为捕获引号的开头,则绝不是直截了当的操作.这种复杂性是由于Rakudo中的错误所致.我已经解决了我认为最简单的方法,但认为最好还是坚持使用<!before $quote> .,除非/直到修复了这些长期存在的Rakudo错误. 5

However, it is by no means straight-forward to do this when the quote is a variable whose value is set to the capture of the opening quote. This complexity is due to bugs in Rakudo. I've worked out what I think is the simplest way around them but think it likely best to just stick with use of <!before $quote> . unless/until these long-standing Rakudo bugs are fixed.5

token escaped($quote) { '\\' ( $quote | '\\' ) } # I think this is a function;

这是一个令牌,这是一个Regex,这是一个Method,这是一个 Routine :

It's a token, which is a Regex, which is a Method, which is a Routine:

say token { . } ~~ Regex;   # True
say Regex       ~~ Method;  # True
say Method      ~~ Routine; # True

正则表达式主体内的代码({ ... }位)(在这种情况下,该代码是token { . }中的唯一.,它是与单个字符匹配的正则表达式原子). P6正则表达式"slang",而method例程主体内部使用的代码编写在主要的P6"slang"中. 4

The code inside the body (the { ... } bit) of a regex (in this instance the code is the lone . in token { . }, which is a regex atom that matches a single character) is written in the P6 regex "slang" whereas the code used inside the body of a method routine is written in the main P6 "slang".4

regex波浪号(~)运算符是专门为解析而设计的在示例中,这个问题是关于.它读起来更好,因为它可以立即识别,并且可以将开始和结束的引号保持在一起.更重要的是,它可以在发生故障时提供人类可理解的错误消息,因为它可以说出所要查找的结束定界符.

The regex tilde (~) operator is specifically designed for the sort of parsing in the example this question is about. It reads better inasmuch as it's instantly recognizable and keeps the opening and closing quotes together. Much more importantly it can provide a human intelligible error message in the event of failure because it can say what closing delimiter(s) it's looking for.

但是,如果在正则表达式~运算符旁边(正反两边)插入正则表达式中的代码块(带有或不带代码),则必须考虑一个关键问题.您将需要对代码块进行分组,除非您特别希望波浪号将代码块视为其自己的原子.例如:

But there's a key wrinkle you must consider if you insert a code block in a regex (with or without code in it) right next to the regex ~ operator (on either side of it). You will need to group the code block unless you specifically want the tilde to treat the code block as its own atom. For example:

token foo { <quote> ~ $<quote> {} <quotebody($<quote>) }

将匹配一对<quote>之间没有任何东西. (然后尝试匹配<quotebody...>.)

will match a pair of <quote>s with nothing between them. (And then try to match <quotebody...>.)

相反,这是一种在String::Simple::Grammar语法中复制string令牌的匹配行为的方法:

In contrast, here's a way to duplicate the matching behavior of the string token in the String::Simple::Grammar grammar:

token string { <quote> ~ $<quote> [ {} <quotebody($<quote>) ] }

脚语

1 在2002年,Larry Wall写了.计算机科学家注意到,您不能在传统正则表达式中间使用过程代码 .但是Perls很久以前就转向了非传统正则表达式,P6已经到达逻辑结论-在正则表达式中间插入任意过程代码只需一个简单的{...}即可.语言设计和regex/语法引擎实现 3 确保可以识别正则表达式中的传统样式纯声明性区域,因此可以将正式的正则表达式理论和优化应用于它们,但是可以使用任意的正则程序代码也可以插入.简单用法包括匹配逻辑

Footnotes

1 In 2002 Larry Wall wrote "It needs to be just as easy for a regex to call Perl code as it is for Perl code to call a regex.". Computer scientists note that you can't have procedural code in the middle of a traditional regular expression. But Perls long ago led the shift to non-traditional regexes and P6 has arrived at the logical conclusion -- a simple {...} is all it takes to insert arbitrary procedural code in the middle of a regex. The language design and regex/grammar engine implementation3 ensure that traditional style purely declarative regions within a regex are recognized, so that formal regular expression theory and optimizations can be applied to them, but nevertheless arbitrary regular procedural code can also be inserted. Simple uses include matching logic and debugging. But the sky's the limit.

2 正则表达式的第一个过程元素(如果有的话)终止正则表达式的声明性前缀".插入空代码块({})的常见原因是,当为给定的最长的替换. (但这不是将其包含在您想要了解的令牌中的原因.)

2 The first procedural element of a regex, if any, terminates what's called the "declarative prefix" of the regex. A common reason for inserting an empty code block ({}) is to deliberately terminate a regex's declarative prefix when that provides the desired matching semantics for a given longest alternation in a regex. (But that isn't the reason for its inclusion in the token you're trying to understand.)

3 松散地说, NQP 中的正则表达式/语法引擎P6 PCRE 是P5的意思.

3 Loosely speaking, the regex / grammar engine in NQP is to P6 what PCRE is to P5.

主要区别在于,正则表达式语言及其关联的正则表达式/语法引擎以及与之合作的主要语言(在Rakudo的情况下为Perl 6)在控制方面是同等的.这是拉里·沃尔(Larry Wall)最初的2002年的实现正则表达式与丰富语言"之间集成的愿景.每种语言/运行时都可以调用其他语言/运行时,并通过高级FFI进行通信.因此,它们可以看起来像,可以充当,甚至可以充当,由协作语言和协作运行时组成的单个系统.

A key difference is that the regex language, along with its associated regex/grammar engine, and the main language it cooperates with, which in the case of Rakudo is Perl 6, are co-equals control-wise. This is an implementation of Larry Wall's original 2002 vision for integration between regexes and "rich languages". Each language/run-time can call into the other and communicate via high level FFIs. So they can appear to be, can behave as, and indeed are, a single system of cooperating languages and cooperating run-times.

(P6设计使得所有语言都可以通过两个互补的P6 6模型和/或C调用约定FFI NativeCall .)

(The P6 design is such that all languages can be explicitly designed, or be retro-fitted, to cooperate in a "rich" manner via two complementary P6 FFIs: the metamodel FFI 6model and/or the C calling convention FFI NativeCall.)

4 P6语言实际上是一起使用的子语言(也称为语)的集合.在读取或编写P6代码时,您正在读取或编写源代码,该源代码以一种语开始,但有部分用其他语编写.文件的第一行使用主main语.可以说这类似于英语.正则表达式用另一种语编写;假设这就像西班牙语.因此,对于语法String::Simple::Grammar ,代码以英语开头( (use v6;语句),然后递归到西班牙语(在rule TOP {{之后),即^ <string> $位,然后返回英语(以# Note ...开头的注释).然后将其递归为<quote> {} <quotebody($<quote>)> $<quote>的西班牙语,并在该西班牙语的中间,在{}代码块处,再次递归为英语的另一级.这就是西班牙语内的英语,英语内的英语.当然,代码块是空的,所以就像用英语写/读任何东西,然后立即放回西班牙语一样,但重要的是要了解,这种递归的语言/运行时堆栈是P6的工作方式,两者都是一个单一的总体语言/运行时,以及与其他非P6语言/运行时的配合.

4 The P6 language is actually a collection of sub-languages -- aka slangs -- that are used together. When you are reading or writing P6 code you are reading or writing source code that starts out in one slang but has sections written in others. The first line in a file uses the main slang. Let's say that's analogous to English. Regexes are written in another slang; let's say that's like Spanish. So in the case of the grammar String::Simple::Grammar, the code begins in English (the use v6; statement), then recurses into Spanish (after the { of rule TOP {), i.e. the ^ <string> $ bit, and then returns back out into English (the comment starting # Note ...). Then it recurses back into Spanish for <quote> {} <quotebody($<quote>)> $<quote> and in the middle of that Spanish, at the {} codeblock, it recurses into another level of English again. So that's English within Spanish within English. Of course, the code block is empty, so it's like writing/reading nothing in English and then immediately dropping back into Spanish, but it's important to understand that this recursive stacking of languages/run-times is how P6 works, both as a single overall language/run-time and when cooperating with other non-P6 languages/run-times.

5 在应用两个潜在的改进过程中,我遇到了一些错误,这些错误在本脚注的末尾列出. (在briandfoy的答案和本答案中都提到了.)两个改进"是使用~构造,而使用不是引号"构造而不是使用<!before foo> ..最终结果,再加上相关的错误:

5 I encountered several bugs, listed at the end of this footnote, in the process of applying two potential improvements. (Both mentioned in briandfoy's answer and this one.) The two "improvements" are use of the ~ construct, and a "not a quote" construct instead of using <!before foo> .. The final result, plus mention of pertinent bugs:

grammar String::Simple::Grammar {
  rule TOP {^ <string> $}
  token string {
    :my $*not-quote;
    <quote> ~ $<quote>
    [
      { $*not-quote = "<-[$<quote>]>" }
      <quotebody($<quote>)>
    ]
  }
  token quote { '"' | "'" }
  token quotebody($quote) { ( <escaped($quote)> | <$*not-quote> )* }
  token escaped($quote) { '\\' ( $quote | '\\' ) }
}

如果有人知道更简单的方法,我很乐意在下面的评论中听到它.

If anyone knows of a simpler way to do this, I'd love to hear about it in a comment below.

我最终在RT bug数据库中搜索了所有正则表达式错误.我知道不是错误数据库,但是我认为注意以下几点是合理的. Aiui的前两个与匹配变量的发布问题直接互动.

I ended up searching the RT bugs database for all regex bugs. I know SO isn't bug database but I think it's reasonable for me to note the following ones. Aiui the first two directly interact with the issue of publication of match variables.

  • ""< > regex调用语法仅在其使用的regex的 parent 范围内查找词法,而不是在regex本身的范围内查找. rt#127872

  • "the < > regex call syntax looks up lexicals only in the parent scope of the regex it is used in, and not in the scope of the regex itself." rt #127872

回溯问题涉及正则表达式调用中的传递参数

似乎有很多讨厌的线程错误.多数归结为一个事实,即多个正则表达式功能在后台使用EVAL并且EVAL尚不是线程安全的.幸运的是,官方文档提到了这些.

It looks like there are lots of nasty threading bugs. Most boil down to the fact that several regex features use EVAL behind the scenes and EVAL is not yet thread-safe. Fortunately the official doc mentions these.

由于.parse而无法执行递归语法设置$/ .

Can't do recursive grammars due to .parse setting $/.

6 这个问题和我的回答将我推向了对P6雄心勃勃而复杂的方面的理解的外部界限.我计划很快对nqp和完整P6之间的精确交互以及它们的regex lang语和主要between语之间的移交有更深入的了解,如上面的脚注中所述. (目前,我的希望主要寄希望于刚刚购买 commaide .)如果/如果我有一些结果,我将更新此答案.

6 This question and my answer has pushed me to the outer limits of my understanding of an ambitious and complex aspect of P6. I plan to soon gain greater insight into the precise interactions between nqp and full P6, and the hand-offs between their regex slangs and main slangs, as discussed in footnotes above. (My hopes currently largely rest on having just bought commaide.) I'll update this answer if/when I have some results.

这篇关于perl6语法,不确定示例中的某些语法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆