为什么/如何在匹配重复的任意字符与捕获组时需要额外的变量? [英] Why/how is an additional variable needed in matching repeated arbitary character with capture groups?

查看:49
本文介绍了为什么/如何在匹配重复的任意字符与捕获组时需要额外的变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 perl6 正则表达式匹配具有最小长度的重复任意字符序列.

阅读完https://docs.perl6.org/language/regexes#Capture_numbers 并调整给出的示例,我使用外部变量"编写了此代码:

#使用一个额外的变量 $cperl6 -e '$_="bbaaaaawer";/((.){} :my $c=$0; ($c)**2..*)/&&打印 $0';#输出:aaaaa

为了帮助说明我的问题,perl5 中有一个类似的正则表达式:

#不需要额外的变量perl -e ' $_="bbaaaaawer";/((.)\2{2,})/&&打印 $1';

有人能告诉我将 $0 保存到 $c 的需要/好处以及空 {} 的要求吗?是否有匹配的替代(更好/打高尔夫球)perl6 正则表达式?

提前致谢.

解决方案

选项 #1:不要对包含反向引用的模式进行子捕获

$0 是一个反向引用1.

如果省略包含 $0 的表达式周围的子捕获,则代码有效:

$_="bbaaaaawer";/(.) $0**2..*/&&打印 $/;#啊啊啊

那么您也可以省略{}.(我将在此答案的稍后部分返回为什么您有时需要插入 {}.)

<小时>

但也许您围绕包含反向引用的表达式编写了一个子捕获,因为您认为您需要该子捕获以进行其他一些后续处理.

通常还有其他方法可以做事.在您的示例中,也许您想要一种能够计算重复次数的方法.如果是这样,您可以改为编写:

$_="bbaaaaawer";/(.) $0**2..*/;打印 $/.chars div $0.chars;# 5

工作完成,没有以下部分的复杂性.

选项#2.子捕获在匹配包含反向引用的模式期间不改变当前匹配对象

也许您真的需要对包含反向引用的表达式进行子捕获.

这仍然可以完成,而无需使用子捕获包围$0.这样就省去了下面第三部分讨论的问题.

如果您不需要表达式的子捕获并且表达式不太复杂,则可以使用此技术:

$_="bbaaaaawer";/(.) $=$0**2..*/;打印 $.join;#啊啊啊

这会子捕获命名捕获中匹配表达式的结果,但避免在表达式周围插入额外的子捕获上下文(这就是导致复杂化的原因)下一节讨论).

不幸的是,虽然此技术适用于您问题中的表达式 ($0**2..*),但如果表达式复杂到需要分组,则无效.这是因为语法 $=[...] 不起作用.也许这是可以修复的,而不会影响性能或导致其他问题.2

选项 #3.在子捕获中使用保存的反向引用

最后,我们得出了您在问题中使用的技巧.

自动提供对子捕获的反向引用(例如 $0)不能引用发生在它们之外的子捕获的子捕获'被写入.更新参见我(至少一半)错了!"请注意下方.

因此,如果出于任何原因,您必须创建子捕获(使用 (...)<...>),然后您必须手动将反向引用存储在变量中并使用它.

在我们进入详细解释为什么必须使用变量的最后部分之前,让我们首先通过覆盖最后的皱纹来完成对您问题的初步回答.

{} 强制公布"迄今为止的比赛结果

{} 是强制 :my $c=$0; 每次使用当前正则表达式/语法引擎到达时进行更新所必需的.如果您不编写它,那么正则表达式引擎将无法将 $c 更新为 'a' 的捕获,而是让它停留在 的捕获上'b'.

请阅读Rakudo 对匹配变量的发布".

为什么不能子捕获包含对在子捕获之外发生的捕获的反向引用?

首先,您必须考虑到 P6 中的匹配在语法、语义和实现方面针对嵌套匹配案例进行了优化.

特别是,如果在编写正则表达式或语法时,您编写了编号捕获(使用 (...)),或命名规则/捕获(使用 ;),那么您已经在运行时动态匹配/捕获的子模式树中插入了一个新级别.

请参阅jnthn 的回答了解原因和Brad's 的一些细节讨论.

<小时>

我将添加到这些答案中的是一个(粗略的!)类比,以及关于为什么必须使用变量和 {} 的另一个讨论.

类比从文件系统中的子目录树开始:

<代码>/一种乙Cd

类比如下:

  • 上面的目录结构对应于完成匹配操作的结果.

  • 在整体匹配或语法解析完成后,匹配对象 $/ 指向(类似地)根目录.3

  • 子目录对应于比赛的子捕获.

  • 编号的子匹配/子捕获 $0$1 在匹配操作的顶层,下面显示的这些项目符号对应于子目录<代码>a 和 b.顶级$1子匹配/子捕获的编号子捕获对应于cd子目录.>

  • 匹配期间 $/ 指的是当前匹配对象",对应于 当前工作目录".

  • 引用当前匹配项(当前工作目录)的子捕获(子目录)很容易.

  • 不可能引用当前匹配项(当前工作目录)之外的子捕获(子目录),除非您保存了引用到该外部目录(捕获)或其父目录.也就是说,P6 包括 ../ 的类似物!更新 我很高兴地报告我错了(至少一半)!见 $/ 在正则表达式中?.

如果文件系统导航不支持这些对根的反向引用,那么要做的一件事就是创建一个存储特定路径的环境变量.这大致就是在 P6 正则表达式中将捕获保存在变量中的作用.

核心问题是许多与正则表达式相关的机制与当前匹配"相关.这包括 $/,它指的是 当前 匹配和反向引用,如 $0,它们相对到当前比赛.更新参见我(至少一半)错了!"注意上面.

<小时>

因此,在下文中,这是可运行经由tio.run这里,很容易显示 'bc''c' 代码块插入第三对括号...

$_="abcd";m/( ( . ) ( . ( . ) { 说 $/} ( . ) ) )/;#「bc」␤ 0 =>『c』␤说 $/;#「abcd」␤ 等等.

...但是不可能在第三对括号中引用捕获的a"而不存储a"'s 在常规变量中捕获.更新参见我(至少一半)错了!"注意上面.

这是查看上述匹配的一种方式:

 ↓ 开始 TOP 级别 $/m/( ( . ) ( . ( . ) { 说 $/} ( . ) ) )/;# 捕获abcd"↓ 开始第一个子捕获;TOP 的 $/[0]( ) # 捕获abcd"↓ 开始第一个子子捕获;TOP 的 $/[0][0]( . ) # 捕获a"↓ 开始*秒*子捕获;TOP 的 $/[0][1]( ) # 捕获bcd"↓ 开始子子子捕获;TOP 的 $/[0][1][0]( . ) # 捕获「c」{ 说 $/} # 「bc」␤ 0 =>『c』␤( . ) # 捕获 'd'

如果我们暂时关注$/在正则表达式之外引用什么(以及直接/.../ 正则表达式,但不在子捕获内),然后 that $/ 指的是 整体 Match 对象,最终捕获「abcd」.(在文件系统类比中这个特定 $/ 是根目录.)

$/ 代码块内在第二个子子捕获中是指一个较低级别的匹配对象,特别是在say $/ 被执行,已经匹配到「bc」,并且会在整体匹配结束时继续捕获「bcd」.

但是没有内置方式来引用 'a' 的子捕获,或整体捕获(此时将是 'abc'),来自代码块周围的子捕获.更新参见我(至少一半)错了!"注意上面.

因此,您必须做一些类似于您所做的事情.

可能的改进?

如果 P6 正则表达式中有一个直接的类似物来指定根会怎样?更新参见我(至少一半)错了!"注意上面.

这是一个可能有意义的初步削减.让我们定义一个语法:

我的$*TOP;语法 g {令牌 TOP { { $*TOP := $/} (.) {} <foo>}令牌 foo { <{$*TOP[0]}>}}说 g.parse: 'aa' # 「aa」␤ 0 =>『a』␤ foo =>一种"

因此,也许可以引入一个新变量,该变量对用户空间代码只读,在匹配操作期间绑定到整体匹配对象.更新参见我(至少一半)错了!"注意上面.

但这不仅非常难看(无法使用像 $0 这样方便的简写反向引用),而且将注意力重新集中在还需要插入 {}.并且考虑到在每个原子之后重新发布所有匹配对象的树可能是荒谬的昂贵的,因此一个完整的循环回到了当前的状态.除了此答案中提到的修复之外,我认为当前实施的内容与它可能获得的一样好.

脚注

1 当前的 P6 文档不使用传统的正则表达式术语反向引用",但 $0$1 等被编号为 P6回参考.我见过的编号回溯引用的最简单解释是 关于它们的 SO使用不同的正则表达式方言.在 P6 中,它们以 $ 开始而不是 \ 并且从 0 而不是 1 开始编号.其他正则表达式方言中的 \0 等效于 P6 中的 $/.另外,$0$/[0]的别名,$1$/[1]的别名,等

2 有人可能认为这行得通,但事实并非如此:

$_="bbaaaaawer";/(.) $<不工作>=[$0**2..*]/;打印 $<不工作>.join;# 在字符串上下文中使用 Nil

似乎[...] 并不意味着分组,但不要插入像(...)<...> do"而是分组,不捕获".这使得 $=[$0**2..*] 中的 $ 变得毫无意义.也许这可以合理地修复,也许应该修复.

3 当前匹配变量"文档说:

<块引用>

$/ 是匹配变量.它存储最后一个 Regex 匹配的结果,因此通常包含 Match 类型的对象.

(Fwiw $/ 包含一个 ListMatch 对象,如果像 :global 这样的副词:exhaustive 使用.)

上面的描述忽略了 $/ 的一个非常重要的用例,那就是它在匹配期间的使用,在这种情况下,它包含了到目前为止的结果当前正则表达式的 em>.

按照我们的文件系统类比,$/ 就像当前的工作目录——我们称之为当前工作匹配对象",也就是 CWMO.匹配操作之外,CWMO ($/) 通常是 最后 正则表达式匹配或语法的完成结果解析.(我说通常"是因为它是可写的,所以代码只需 $/= 42 就可以更改它.)During 匹配(或 actions) 操作 CWMO 对于用户态代码是只读的,并且绑定到 Match 由正则表达式/语法引擎为当前匹配或操作规则/方法生成的对象.

I'm matching a sequence of a repeating arbitrary character, with a minimum length, using a perl6 regex.

After reading through https://docs.perl6.org/language/regexes#Capture_numbers and tweaking the example given, I've come up with this code using an 'external variable':

#uses an additional variable $c
perl6 -e '$_="bbaaaaawer"; /((.){} :my $c=$0; ($c)**2..*)/ && print $0';

#Output:  aaaaa

To aid in illustrating my question only, a similar regex in perl5:

#No additional variable needed
perl -e ' $_="bbaaaaawer"; /((.)\2{2,})/ && print $1';

Could someone enlighten me on the need/benefit of 'saving' $0 into $c and the requirement of the empty {}? Is there an alternative (better/golfed) perl6 regex that will match?

Thanks in advance.

解决方案

Option #1: Don't sub-capture a pattern that includes a back reference

$0 is a back reference1.

If you omit the sub-capture around the expression containing $0, then the code works:

$_="bbaaaaawer"; / (.) $0**2..* / && print $/; # aaaaa

Then you can also omit the {}. (I'll return to why you sometimes need to insert a {} later in this answer.)


But perhaps you wrote a sub-capture around the expression containing the back reference because you thought you needed the sub-capture for some other later processing.

There are often other ways to do things. In your example, perhaps you wanted a way to be able to count the number of repeats. If so, you could instead write:

$_="bbaaaaawer";
/ (.) $0**2..* /;
print $/.chars div $0.chars; # 5

Job done, without the complications of the following sections.

Option #2. Sub-capture without changing the current match object during matching of the pattern that includes a back reference

Maybe you really need to sub-capture a match of an expression that includes a back reference.

This can still be done without needing to surround the $0 with a sub-capture. This saves the problems discussed in the third section below.

You can use this technique if you don't need to have sub-sub-captures of the expression and the expression isn't too complicated:

$_="bbaaaaawer";
/ (.) $<capture-when-done>=$0**2..* /;
print $<capture-when-done>.join; # aaaa

This sub-captures the result of matching the expression in a named capture but avoids inserting an additional sub-capture context around the expression (which is what causes the complications discussed in the next section).

Unfortunately, while this technique will work for the expression in your question ($0**2..*) it won't if an expression is complex enough to need grouping. This is because the syntax $<foo>=[...] doesn't work. Perhaps this is fixable without hurting performance or causing other problems.2

Option #3. Use a saved back reference inside a sub-capture

Finally we arrive at the technique you've used in your question.

Automatically available back references to sub-captures (like $0) cannot refer to sub-captures that happened outside the sub-capture they're written in. Update See "I'm (at least half) wrong!" note below.

So if, for any reason, you have to create a sub-capture (using either (...) or <...>) then you must manually store a back reference in a variable and use that instead.

Before we get to a final section explaining in detail why you must use a variable, let's first complete an initial answer to your question by covering the final wrinkle.

{} forces "publication" of match results thus far

The {} is necessary to force the :my $c=$0; to update each time it's reached using the current regex/grammar engine. If you don't write it, then the regex engine fails to update $c to a capture of 'a' and instead leaves it stuck on a capture of 'b'.

Please read "Publication" of match variables by Rakudo.

Why can't a sub-capture include a back reference to captures that happened outside that sub-capture?

First, you have to take into account that matching in P6 is optimized for the nested matching case syntactically, semantically, and implementation wise.

In particular, if, when writing a regex or grammar, you write a numbered capture (with (...)), or a named rule/capture (with <foo>), then you've inserted a new level in a tree of sub-patterns that are dynamically matched/captured at run-time.

See jnthn's answer for why and Brad's for some discussion of details.


What I'll add to those answers is a (rough!) analogy, and another discussion of why you have to use a variable and {}.

The analogy begins with a tree of sub-directories in a file system:

/
  a
  b
    c
    d

The analogy is such that:

  • The directory structure above corresponds to the result of a completed match operation.

  • After an overall match or grammar parse is complete, the match object $/ refers (analogously speaking) to the root directory.3

  • The sub-directories correspond to sub-captures of the match.

  • Numbered sub-matches/sub-captures $0 and $1 at the top level of the match operation shown below these bullet points corresponds to sub-directories a and b. The numbered sub-captures of the top level $1 sub-match/sub-capture corresponds to the c and d sub-directories.

  • During matching $/ refers to the "current match object" which corresponds to the "current working directory".

  • It's easy to refer to a sub-capture (sub-directory) of the current match (current working directory).

  • It's impossible to refer to a sub-capture (sub-directory) outside the current match (current working directory) unless you've saved a reference to that outside directory (capture) or a parent of it. That is, P6 does not include an analog of .. or /! Update I'm happy to report that I'm (at least half) wrong! See What's the difference between $/ and in regex?.

If file system navigation didn't support these back references towards the root then one thing to do would be to create an environment variable that stored a particular path. That's roughly what saving a capture in a variable in a P6 regex is doing.

The central issue is that a lot of the machinery related to regexes is relative to "the current match". And this includes $/, which refers to the current match and back references like $0, which are relative to the current match. Update See "I'm (at least half) wrong!" note above.


Thus, in the following, which is runnable via tio.run here, it's easy to display 'bc' or 'c' with a code block inserted in the third pair of parens...

$_="abcd";
m/ ( ( . ) ( . ( . ) { say $/ } ( . ) ) ) /; # 「bc」␤ 0 => 「c」␤
say $/;                                      # 「abcd」␤ etc.

...but it's impossible to refer to the captured 「a」 in that third pair of parens without storing 「a」's capture in a regular variable. Update See "I'm (at least half) wrong!" note above.

Here's one way of looking at the above match:

  ↓ Start TOP level $/
m/ ( ( . ) ( . ( . ) { say $/ } ( . ) ) ) /; # captures 「abcd」

    ↓ Start first sub-capture; TOP's $/[0]
   (                                    )    # captures 「abcd」

      ↓ Start first sub-sub-capture; TOP's $/[0][0]
     ( . )                                   # captures 「a」

            ↓ Start *second* sub-sub-capture; TOP's $/[0][1]
           (                          )      # captures 「bcd」

                ↓ Start sub-sub-sub-capture; TOP's $/[0][1][0]
               ( . )                         # captures 「c」

                     { say $/ }              # 「bc」␤ 0 => 「c」␤

                                 ( . )       # captures 'd'

If we focus for a moment on what $/ refers to outside of the regex (and also directly inside the /.../ regex, but not inside sub-captures), then that $/ refers to the overall Match object, which ends up capturing 「abcd」. (In the filesystem analogy this particular $/ is the root directory.)

The $/ inside the code block inside the second sub-sub-capture refers to a lower level match object, specifically the one that, at the point the say $/ is executed, has already matched 「bc」 and will go on to have captured 「bcd」 by the end of the overall match.

But there's no built in way to refer to the sub-capture of 'a', or the overall capture (which at that point would be 'abc'), from within the sub-capture surrounding the code block. Update See "I'm (at least half) wrong!" note above.

Hence you have to do something like what you've done.

A possible improvement?

What if there were a direct analog in P6 regexes for specifying the root? Update See "I'm (at least half) wrong!" note above.

Here's an initial cut at this that might make sense. Let's define a grammar:

my $*TOP;
grammar g {
  token TOP { { $*TOP := $/ } (.) {} <foo> }
  token foo { <{$*TOP[0]}> }
}
say g.parse: 'aa' # 「aa」␤ 0 => 「a」␤ foo => 「a」

So, perhaps a new variable could be introduced, one that's read only for userland code, that's bound to the overall match object during a match operation. Update See "I'm (at least half) wrong!" note above.

But then that's not only pretty ugly (unable to use a convenient short-hand back reference like $0) but refocuses attention on the need to also insert a {}. And given that it would presumably be absurdly expensive to republish all the tree of match objects after each atom, one is brought full circle back to the current status quo. Short of the fixes mentioned in this answer, I think what is currently implemented is as good as it's likely to get.

Footnotes

1 The current P6 doc doesn't use the conventional regex term "back reference" but $0, $1 etc. are numbered P6 back references. The simplest explanation I've seen of numbered back references is this SO about them using a different regex dialect. In P6 they start with $ instead of \ and are numbered starting from 0 rather than 1. The equivalent of \0 in other regex dialects is $/ in P6. In addition, $0 is an alias for $/[0], $1 for $/[1], etc.

2 One might think this would work, but it doesn't:

$_="bbaaaaawer";
/ (.) $<doesn't-work>=[$0**2..*] /;
print $<doesn't-work>.join; # Use of Nil in string context

It seems that [...] doesn't mean "group, but don't insert a new capture level like (...) and <...> do" but instead "group, and do not capture". This renders the $<doesn't-work> in $<doesn't-work>=[$0**2..*] meaningless. Perhaps this can reasonably be fixed and perhaps it should be fixed.

3 The current "match variable" doc says:

$/ is the match variable. It stores the result of the last Regex match and so usually contains objects of type Match.

(Fwiw $/ contains a List of Match objects if an adverb like :global or :exhaustive is used.)

The above description ignores a very important use case for $/ which is its use during matching, in which case it contains the results so far of the current regex.

Following our file system analogy, $/ is like the current working directory -- let's call it "the current working match object" aka CWMO. Outside a matching operation the CWMO ($/) is ordinarily the completed result of the last regex match or grammar parse. (I say "ordinarily" because it's writable so code can change it with as little as $/ = 42.) During matching (or actions) operations the CWMO is read-only for userland code and is bound to a Match object generated by the regex/grammar engine for the current match or action rule/method.

这篇关于为什么/如何在匹配重复的任意字符与捕获组时需要额外的变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆