为什么 Perl 回溯匹配失败似乎比匹配成功花费的时间更少? [英] Why does Perl backtracking match failure seem to take less time than match success?

查看:60
本文介绍了为什么 Perl 回溯匹配失败似乎比匹配成功花费的时间更少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的文件aab.txt,其内容是aaa...aab.

I have a huge file aab.txt whose contents are aaa...aab.

令我大吃一惊

perl -ne '/a*bb/' < aab.txt

运行(匹配失败)比

perl -ne '/a*b/' < aab.txt

(匹配成功).为什么????两者都应该首先吞噬所有的 a,然后第二个立即成功,而第一个则必须一遍又一遍地回溯,失败.

(match success). Why???? Both should first gobble up all the a's, then the second one immediately succeeds, while the first will then have to backtrack over and over again, to fail.

推荐答案

Perl 正则表达式被优化为宁愿尽早失败,而不是尽快成功.这在通过大型日志文件 grep 时很有意义.

Perl regexes are optimized to rather fail as early as possible, than to succeed as fast as possible. This makes a lot of sense when grepping through a large log file.

有一个优化,首先寻找字符串的一个常量部分,在这种情况下,一个浮动"的 bbb.这可以相当有效地检查,而不必跟踪回溯状态.没有找到 bb,匹配就在那里中止.

There is an optimization that first looks for a constant part of the string, in this case, a "floating" b or bb. This can be checked rather efficiently without having to keep track of backtracking state. No bb is found, and the match aborts right there.

b 并非如此.找到该浮动子字符串,并从那里构建匹配.这是正则表达式匹配的调试输出(程序是 "aaab" =~/a*b/):

Not so with b. That floating substring is found, and the match constructed from there. Here is the debug output of the regex match (program is "aaab" =~ /a*b/):

Compiling REx "a*b"
synthetic stclass "ANYOF_SYNTHETIC[ab][]".
Final program:
   1: STAR (4)
   2:   EXACT <a> (0)
   4: EXACT <b> (6)
   6: END (0)
floating "b" at 0..2147483647 (checking floating) stclass ANYOF_SYNTHETIC[ab][] minlen 1 
Guessing start of match in sv for REx "a*b" against "aaab"
Found floating substr "b" at offset 3...
start_shift: 0 check_at: 3 s: 0 endpos: 4 checked_upto: 0
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "a*b" against "aaab"
Matching stclass ANYOF_SYNTHETIC[ab][] against "aaab" (4 bytes)
   0 <> <aaab>               |  1:STAR(4)
                                  EXACT <a> can match 3 times out of 2147483647...
   3 <aaa> <b>               |  4:  EXACT <b>(6)
   4 <aaab> <>               |  6:  END(0)
Match successful!
Freeing REx: "a*b"

您可以使用 re pragma 的 debug 选项获得这样的输出.

You can get such output with the debug option for the re pragma.

严格来说,查找 bbb 是不必要的,但它可以让匹配更早地失败.

Finding the b or bb is unnecessary, strictly speaking, but it allows the match to fail much earlier.

这篇关于为什么 Perl 回溯匹配失败似乎比匹配成功花费的时间更少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆