Raku 正则表达式:不一致的最长令牌匹配 [英] Raku regex: Inconsistent longest token matching

查看:67
本文介绍了Raku 正则表达式:不一致的最长令牌匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Raku 的正则表达式应该匹配最长的标记.

事实上,在这段代码中可以看到这种行为:

raku -e "'AA' ~~ m/A {say 1}|AA {say 2}/";# 2

但是,当文本在变量中时,它的工作方式似乎不同:

raku -e "my $a = 'A';我的 $b = 'AA';'AA' ~~ m/$a {say 1}|$b {say 2}/"#1

为什么他们的工作方式不同?有没有办法使用变量并且仍然匹配最长的令牌?

解决方案

这里有两件事在起作用.

第一个是最长令牌"的含义.当有替换(使用| 或使用proto 正则表达式隐含)时,提取每个分支的声明性前缀.声明性表示 Raku 正则表达式语言的子集,可以通过 有限状态机进行匹配.声明性前缀是通过采用正则表达式元素直到遇到非声明性元素来确定的.您可以阅读更多内容并在文档中找到更多参考资料.>

要了解事情为什么会这样,绕道一小段路可能会有所帮助.构建解析器的一种常见方法是编写一个分词器,将输入文本分解成一系列标记",然后是一个解析器,从这些标记中识别出更大的(可能是递归的)结构.标记化通常使用有限状态机执行,因为它能够快速减少搜索空间.使用 Raku 语法,我们不会自己编写分词器;相反,它会自动为我们从语法中提取(更准确地说,每个交替点计算一个分词器).

其次,Raku 正则表达式是主要 Raku 语言中的嵌套语言,与它一起解析并同时编译.(这与大多数语言不同,在大多数语言中,正则表达式作为我们传递字符串的库提供.)最长的标记计算发生在编译时.但是,变量是在运行时插入的.因此,正则表达式中的变量插值是非声明性的,因此不被视为最长标记匹配的一部分.

Raku's regexes are expected to match longest token.

And in fact, this behaviour is seen in this code:

raku -e "'AA' ~~ m/A {say 1}|AA {say 2}/"
# 2

However, when the text is in a variable, it does not seem to work in the same way:

raku -e "my $a = 'A'; my $b = 'AA'; 'AA' ~~ m/$a {say 1}|$b {say 2}/"
# 1

Why they work in a different way? Is there a way to use variables and still match the longest token?

解决方案

There are two things at work here.

The first is the meaning of "longest token". When there is an alternation (using | or implied by use of proto regexes), the declarative prefix of each branch is extracted. Declarative means the subset of the Raku regex language that can be matched by a finite state machine. The declarative prefix is determined by taking regex elements until a non-declarative element is encountered. You can read more and find some further references in the docs.

To understand why things are this way, a small detour may be helpful. A common approach to building parsers is to write a tokenizer, which breaks the input text up into a sequence of "tokens", and then a parser that identifies larger (and perhaps recursive) structure from those tokens. Tokenizing is typically performed using a finite state machine, since it is able to rapidly cut down the search space. With Raku grammars, we don't write the tokenizer ourselves; instead, it's automatically extracted from the grammar for us (more precisely, a tokenizer is calculated per alternation point).

Secondly, Raku regexes are a nested language within the main Raku language, parsed in a single pass with it and compiled at the same time. (This is a departure from most languages, where regexes are provided as a library that we pass strings to.) The longest token calculation takes place at compile time. However, variables are interpolated at runtime. Therefore, a variable interpolation in a regex is non-declarative, and therefore is not considered as part of the longest token matching.

这篇关于Raku 正则表达式:不一致的最长令牌匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆