为什么某些正则表达式引擎在单个输入字符串中两次匹配.*? [英] Why do some regex engines match .* twice in a single input string?

查看:34
本文介绍了为什么某些正则表达式引擎在单个输入字符串中两次匹配.*?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

许多正则表达式引擎在单行字符串中匹配.* 两次,例如,在执行基于正则表达式的字符串替换时:

Many regex engines match .* twice in a single-line string, e.g., when performing regex-based string replacement:

  • 根据定义,第一个匹配项是整个(单行)字符串.
  • 在许多引擎中,存在第二个匹配项,即空字符串;也就是说,即使第一次匹配已消耗了整个输入字符串,.* 也会再次匹配 ,然后匹配输入字符串末尾的空字符串.

  • The 1st match is - by definition - the entire (single-line) string, as expected.
  • In many engines there is a 2nd match, namely the empty string; that is, even though the 1st match has consumed the entire input string, .* is matched again, which then matches the empty string at the end of the input string.

  • 注意:为确保仅找到一个匹配项,请使用 ^.*
  • Note: To ensure that only one match is found, use ^.*

我的问题是:

  • 此行为是否有充分的理由?输入字符串全部用完后,就不会再尝试找到匹配项了.

  • Is there a good reason for this behavior? Once the input string has been consumed in full, I wouldn't expect another attempt to find a match.

除了反复试验之外,您还可以从文档/regex方言/支持的标准中收集哪些引擎表现出这种行为吗?

Other than trial and error, can you glean from the documentation / regex dialect/standard supported which engines exhibit this behavior?

更新: revo的有用答案说明了方法当前行为;至于潜在的为什么,请参见此相关问题.

Update: revo's helpful answer explains the how of the current behavior; as for the potential why, see this related question.

确实表现出这种行为的语言/平台:

Languages/platforms that DO exhibit the behavior:

 # .NET, via PowerShell (behavior also applies to the -replace operator)
 PS> [regex]::Replace('a', '.*', '[$&]'
 [a][]  # !! Note the *2* matches, first the whole string, then the empty string

 # Node.js
 $ node -pe "'a'.replace(/.*/g, '[$&]')"
 [a][]

 # Ruby
 $ ruby -e "puts 'a'.gsub(/.*/, '[\\0]')"
 [a][]

 # Python 3.7+ only
 $ python -c "import re; print(re.sub('.*', '[\g<0>]', 'a'))"
 [a][] 

 # Perl 5
 $ echo a | perl -ple 's/.*/[$&]/g'
 [a][] 

 # Perl 6
 $ echo 'a' | perl6 -pe 's:g/.*/[$/]/'
 [a][]

 # Others?

不表现出这种行为的语言/平台:

Languages/platforms that do NOT exhibit the behavior:

# Python 2.x and Python 3.x <= 3.6
$ python -c "import re; print(re.sub('.*', '[\g<0>]', 'a'))"
[a]  # !! Only 1 match found.

# Others?


气泡提出了一些很好的相关观点:


bobble bubble brings up some good related points:

如果您像.*?一样懒惰,甚至会得到 3个匹配项另2个匹配项.与 .?? 相同.一旦我们使用开始锚点,我以为我们应该只进行一次比赛,但是有趣的是,似乎 ^.*?给出了

If you make it lazy like .*?, you'd even get 3 matches in some and 2 matches in others. Same with .??. As soon as we use a start anchor, I thought we should get only one match, but interestingly it seems ^.*? gives two matches in PCRE for a, whereas ^.* should result in one match everywhere.


这是一个 PowerShell 代码段,用于使用多种正则表达式测试跨语言的行为:

注意:假定Python 3.x以 python3 的形式提供,而Perl 6以 perl6 的形式提供.
您可以将整个代码段直接粘贴在命令行上,然后从历史记录中调用它来修改输入.

Note: Assumes that Python 3.x is available as python3 and Perl 6 as perl6.
You can paste the whole snippet directly on the command line and recall it from the history to modify the inputs.

& {
  param($inputStr, $regexes)

  # Define the commands as script blocks.
  # IMPORTANT: Make sure that $inputStr and $regex are referenced *inside "..."*
  #            Always use "..." as the outer quoting, to work around PS quirks.
  $cmds = { [regex]::Replace("$inputStr", "$regex", '[$&]') },
          { node -pe "'$inputStr'.replace(/$regex/g, '[$&]')" },
          { ruby -e "puts '$inputStr'.gsub(/$regex/, '[\\0]')" },
          { python -c "import re; print(re.sub('$regex', '[\g<0>]', '$inputStr'))" },
          { python3 -c "import re; print(re.sub('$regex', '[\g<0>]', '$inputStr'))" },
          { "$inputStr" | perl -ple "s/$regex/[$&]/g" },
          { "$inputStr" | perl6 -pe "s:g/$regex/[$/]/" }

  $regexes | foreach {
    $regex = $_
    Write-Verbose -vb "----------- '$regex'"
    $cmds | foreach { 
      $cmd = $_.ToString().Trim()
      Write-Verbose -vb ('{0,-10}: {1}' -f (($cmd -split '\|')[-1].Trim() -split '[ :]')[0], 
                                           $cmd -replace '\$inputStr\b', $inputStr -replace '\$regex\b', $regex)
      & $_ $regex
    }
  }

} -inputStr 'a' -regexes '.*', '^.*', '.*$', '^.*$', '.*?'

正则表达式 ^.* 的示例输出,确认了气泡的期望,即使用起始锚( ^ )仅产生一个匹配所有语言:

Sample output for regex ^.*, which confirms bobble bubble's expectation that using the start anchor (^) yields only one match in all languages:

VERBOSE: ----------- '^.*'
VERBOSE: [regex]   : [regex]::Replace("a", "^.*", '[$&]')
[a]
VERBOSE: node      : node -pe "'a'.replace(/^.*/g, '[$&]')"
[a]
VERBOSE: ruby      : ruby -e "puts 'a'.gsub(/^.*/, '[\\0]')"
[a]
VERBOSE: python    : python -c "import re; print(re.sub('^.*', '[\g<0>]', 'a'))"
[a]
VERBOSE: python3   : python3 -c "import re; print(re.sub('^.*', '[\g<0>]', 'a'))"
[a]
VERBOSE: perl      : "a" | perl -ple "s/^.*/[$&]/g"
[a]
VERBOSE: perl6     : "a" | perl6 -pe "s:g/^.*/[$/]/"
[a]

推荐答案

Kinda有趣的问题.与其先提及您的问题,不如让我发表您的评论.

Kinda interesting question. Instead of referring to your questions first, I'll go for your comment.

输入的字符串用完后,为什么要把空字符串当作空字符串呢?

Once the input string has been consumed in full, why would you treat the fact that there is nothing left as the empty string?

剩下一个叫做主题字符串结尾的位置.这是一个位置,可以匹配.像其他零宽度断言和锚点 \ b \ B ^ $ ...,点星号.* 可以匹配一个空字符串.这高度依赖于正则表达式引擎.例如.TRegEx的处理方式有所不同.

A position called end of subject string is left. It's a position and can be matched. Like other zero-width assertions and anchors \b, \B, ^, $... that assert, a dot-star .* can match an empty string. This is highly dependent on regex engine. E.g. TRegEx does it differently.

如果这样做,是否应该导致无限循环?

And if you do, shouldn't this result in an infinite loop?

不,这是正则表达式引擎要处理的主要工作.它们引发一个标志并存储当前的游标数据,以避免发生此类循环.Perl docs 以这种方式进行解释:

No, this is of the main jobs of regex engines to handle. They raise a flag and store current cursor data to avoid such loops to occur. Perl docs explain it this way:

这种能力的普遍滥用源于无限创造的能力使用正则表达式进行循环,其中包含一些无害的内容:

A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as:

'foo' =~ m{ ( o? )* }x;

o? foo 的开头匹配,并且由于字符串不会因匹配而移动, o?会一次又一次地匹配由于 * 量词的缘故.创建相似内容的另一种常见方式循环是使用循环修饰符/g ...

The o? matches at the beginning of foo, and since the position in the string is not moved by the match, o? would match again and again because of the * quantifier. Another common way to create a similar cycle is with the looping modifier /g...

因此,Perl通过强行打破无限,从而实现了这样的构造循环.对于以下规则给出的较低级别的循环,此规则有所不同贪婪的量词 * + {} ,对于像/g 修饰符或 split()运算符.

Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy quantifiers *+{} , and for higher-level ones like the /g modifier or split() operator.

较低级别的循环被中断(也就是说,循环中断了)当Perl检测到重复的表达式与零长度的子字符串匹配时.

The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring.

现在回到您的问题:

这种行为是否有充分的理由?

Is there a good reason for this behavior?

是的,有.每个正则表达式引擎必须处理大量挑战才能处理文本.其中之一是处理零长度匹配.您的问题提出了另一个问题,

Yes, there is. Every regex engine has to meet a significant amount of challenges in order to process a text. One of which is dealing with zero-length matches. Your question raises another question,

问:匹配零长度字符串后,引擎应该如何进行?

Q: How does an engine should proceed after matching a zero-length string?

A:一切取决于.

它与之匹配,然后引发一个标志以使其与(相同)再次不匹配相同位置?模式.在PCRE中,.* 匹配整个主题字符串,然后紧随其后停止.最后,当前位置是PCRE中有意义的位置,可以匹配或声明位置,因此还有一个位置(零长度字符串)要匹配.PCRE再次通过正则表达式(如果启用了 g 修饰符),并在主题末尾找到匹配项.

It matches it then raises a flag to not match the same position again with the (same)? pattern. In PCRE .* matches entire subject string then stops right after it. Being at the end, current position is a meaningful position in PCRE, positions can be matched or being asserted so there is a position (zero-length string) left to be matched. PCRE goes through the regex again (if g modifier is enabled) and finds a match at the end of subject.

然后,PCRE尝试前进到下一个即时位置以再次运行整个过程,但是由于没有位置可用,因此它失败了.

PCRE then tries to advance to the next immediate position to run whole process again but it fails since there is no position left.

您会看到是否要防止第二场比赛的发生,您需要以某种方式告诉引擎:

You see if you want to prevent the second match from being happened you need to tell engine in some way:

^.*

或者提供更好的洞察力:

Or to provide a better insight into what is going on:

(?!$).*

请参见此处实时演示,专门查看 查看全文

登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆