RegEx调试 [英] RegEx debugging

查看:141
本文介绍了RegEx调试的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用两个我可以访问的调试工具通过字符串AAAC(来自rexegg.com的示例)调试正则表达式^(A+)*B.

I'm debugging a Regular Expression ^(A+)*B over a string AAAC (example from rexegg.com) by two separate debugging tools I have access:

  1. regex101.com
  2. RegexBuddy v4

下面是结果(左侧为regex101):

Below is the results (regex101 in the left side):

我的问题主要不在于步骤数,这对我来说也很重要,而是回溯如何完成.为什么我们看到差异? (regex101使用PCRE lib,我将RegexBuddy lib设置为相同)

The question I have is not mainly about the number of steps which is also important to me, but is how backtracks are done. Why do we see differences? (regex101 uses PCRE lib and I set RegexBuddy lib the same)

我完全赞成逐步进行全面的解释.

A comprehensive step by step explanation is really in my favor.

推荐答案

我不会依赖步骤数或任何调试作为测试
回溯或失败的原因.

I wouldn't rely on either the number of steps or any debugging as a test
of either backtracking or failure.

通常,远离诸如(A+)*之类的简单构造,而不仅仅是
回溯内部A+,但也回溯外部(..)*.
每一次 outer 都会产生新的(新的) inner 系列回溯.

Generally, keep away from simple constructs such as (A+)* that not only
backtrack the inner A+ but backtrack the outter (..)* as well.
Each pass of the outter produces a fresh (new) inner series of backtracking.

获取优质的基准软件,例如 RegexFormat
并测试一系列指数时间结果.
线性结果是您要寻找的内容,因为内容增加了相同的
数量.

Get a good benchmark software like RegexFormat
and test a series for an exponential time result.
A linear result is what you are looking for as the content increases by the same
amount.

以下是(A+)?(A+)*的示例.我们将目标设置为已知故障
并稳定地增加长度以扩展对该故障的处理.

Below is an example of (A+)? vs (A+)*. We set the target to a known failure
and steadily increase the length to extend the processing of that failure.

如果查看正则表达式2 (A+)*,您会发现just
的指数增长. 三个目标增加.
最后,我们爆破了目标,使发动机的内部资源超负荷.

If you look at regex 2 (A+)* you can notice the exponential increase in just
three target increases.
Finally, we blow out the target to overload the internal resources of the engine.

编辑 :经过分析,我对回溯步骤发表了微不足道的结论.
通过下面的基准时间分析,回溯似乎是一个 2 ^ N 过程.
其中 N 是失败时回溯的字符文字的数量.

Edit: After some analysis, I post a meager conclusion on backtracking steps.
By analysis of the benchmark time's below, backtracking appears to be a 2^N process.
Where N is the number of character literals that are backtracked on failure.

考虑到Revo的简单情况,隔离回溯要容易一些.
因为看起来好像占总时间的98%只是回溯.

Given Revo's simple case, it's a bit easier to isolate the backtracking.
Because it looks like %98 of the total time taken involves just backtracking.

鉴于此假设,一个人可以从2个点中获取时间结果,并生成一个方程.

Given that assumption, one can take the time results from 2 points, and generate an equation.

我使用的方程式的 form f(x) = a * r^x.
给定坐标('A's vs.时间),​​使用点数
(7,1.13),(9,4.43)生成了此f(x) = 0.009475 * 1.9799 ^ x
真的是这个# sec's = 0.009475 * 1.9799 ^ # letters
我将以下基准中所有的字母"A"输入该公式.

The form of the equation I used was f(x) = a * r^x.
Given coordinates (# of 'A's vs. Time taken), using points
(7, 1.13) , (9, 4.43) which generated this f(x) = 0.009475 * 1.9799 ^ x
which is really this # sec's = 0.009475 * 1.9799 ^ # letters
I ran all the number of letter 'A's from the benchmark's below into this formula.

#LTR's   Bench Time
 7         1.13
 9         4.43
13        70.51

产生了准确的基准时间(+/- .1%).

which produced the exact benchmark time ( +/- .1% ).

然后我看到1.9799非常接近2.0,因此我将'a'因子调整为.009并再次运行,得到了这个:

I then saw that 1.9799 is pretty close to 2.0, so I adjusted the 'a' factor down to .009 and ran it again, getting this:

f(7 letters) = 2 ^ 7 * .009 =  1.152 sec’s
f(9 letters) = 2 ^ 9 * .009 =  4.608 sec’s
f(13 letters) = 2 ^ 13 * .009 =  73.728 sec’s
f(27 letters) = 2 ^ 27 * .009 =  1207959.552 sec’s = 335 hours  

现在似乎很清楚,回溯步骤与
的数量相关 以 2 ^ N 关系回溯的字母.

It seems pretty clear now that backtracking steps correlate to the number of
letters to backtrack in a 2 ^ N relationship.

我还认为,某些引擎可以完成这种简单的2 ^ N数学运算
只有在这种简单的情况下.这些似乎是
引擎立即返回并显示太复杂了!.
这里的翻译是,正则表达式非常简单,引擎可以
检测到它.有时候引擎永远不会回来,
它已被挂起,可能在一两年(或十年)后回来.

I also think it's a fair bet that some engines can do this simple 2^N math
only in simple scenario's like this one. These seem to be the times where
the engine comes back immediately and says Too complex!.
The translation here is that the regex is simple enough that the engine can
detect it. Other times, the engine never comes back,
it's hung, and may come back in a year or two (or ten..).

因此,结论不是引擎是否会回溯,而是如何
您的目标字符串会影响回溯吗?

Conclusion therefore is not if the engine will backtrack, it will, but how
will your target string affect the backtracking.

因此,编写正则表达式时必须保持警惕,并且必须警惕
嵌套的开放式量词.

So, vigilance is required when writing regex, and must be on guard against
nested open ended quantifiers.

我想说,最好的选择总是打个正则表达式,使之失败.
我说的是在可疑的地方有过多的重复文字.
结束编辑

I'd say the best bet is always to hit a regex real hard to get it to fail.
And I'm talking about excessive repetitive literals in suspect places.
end edit

基准应用程序

目标:AAAAAAAC

Target: AAAAAAAC

基准

Regex1:   ^(A+)?B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    0.07 s,   72.04 ms,   72040 µs


Regex2:   ^(A+)*B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    1.13 s,   1126.44 ms,   1126444 µs

目标:AAAAAAAAAC

Target: AAAAAAAAAC

基准

Regex1:   ^(A+)?B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    0.08 s,   82.43 ms,   82426 µs


Regex2:   ^(A+)*B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    4.43 s,   4433.19 ms,   4433188 µs

目标:AAAAAAAAAAAAAC

Target: AAAAAAAAAAAAAC

基准

Regex1:   ^(A+)?B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    0.10 s,   104.02 ms,   104023 µs


Regex2:   ^(A+)*B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    70.51 s,   70509.32 ms,   70509321 µs

目标:AAAAAAAAAAAAAAAAAAAAAAAAAAAC

Target: AAAAAAAAAAAAAAAAAAAAAAAAAAAC

基准

Regex1:   ^(A+)?B
Options:  < none >
Completed iterations:   50  /  50     ( x 1000 )
Matches found per iteration:   0
Elapsed Time:    0.18 s,   184.05 ms,   184047 µs


Regex2:   ^(A+)*B
Options:  < none >
Error:  Regex Runtime Error !!   
Completed iterations:   0  /  50     ( x 1000 )
Matches found per iteration:   -1
Elapsed Time:    0.01 s,   7.38 ms,   7384 µs


Error item 2 :  Target Operation .. 

The complexity of matching the regular expression exceeded predefined
bounds. Try refactoring the regular expression to make each choice made by
the state machine unambiguous. This exception is thrown to prevent
"eternal" matches that take an indefinite period time to locate. 

这篇关于RegEx调试的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆