正则表达式进入无限循环 [英] regular expression goes into infinite loop

查看:166
本文介绍了正则表达式进入无限循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析表格中的(种类)名称:

  Parus Ater 
H. sapiens
T. rex
Tyr。 rex

通常有两个术语(二项式)但有时有3个或更多。

  Troglodytes troglodytes troglodytes 
E. rubecula sensu stricto

我写了

  [AZ] [az] * \ 。?\s + [az] [az] +(\\\ * [az] +)* 

大部分时间都有效,但偶尔会进入无限循环。花了一些时间来追踪它是在正则表达式匹配,然后我意识到这是一个错字,我应该写了

  [AZ] [az] * \。?\s + [az] [az] +(\s + [az] +)* 

执行得当。



我的问题是:




  • 为什么会发生这种循环?

  • 有没有办法在运行程序之前检查类似的正则表达式错误?否则在分发prgram之前可能很难捕获它们并导致问题。



[注意:我不需要更多物种的一般表达式 - 物种名称有一个正式的100+行正则表达式规范 - 这只是一个初始过滤器]。



注意:问题出现了,因为虽然大多数名称被精确地提取为2或偶尔3/4项(因为它们用斜体字表示),但有一些误报(如Homo sapiens住在像伦敦这样的大城市)并且比赛在L处失败。]



注意:在调试中我发现正则表达式经常完成但速度很慢(例如在较短的目标字符串上)。通过病理案例我发现了这个错误是很有价值的。我已经学到了一个重要的教训!

解决方案

要解决问题的第一部分,你应该阅读灾难性的回溯。从本质上讲,正在发生的事情是将正则表达式与字符串匹配的方法太多,并且解析器会不断回溯以尝试使其正常工作。



In你的情况,它可能是嵌套的重复:(\s * [az] +)* 这可能导致一些非常奇怪的循环。正如Qtax娴熟地指出的那样,没有更多的信息就很难说清楚。



遗憾的是,问题的第二部分是不可能回答的。它基本上是暂停问题。由于正则表达式本质上是一个有限状态机,其输入是一个字符串,你不能创建一个通用的解决方案来预测哪些正则表达式将灾难性地回溯,哪些不会../ / p>

至于使正则表达式运行得更快的一些技巧?这是一大堆蠕虫。我花了很多时间自己研究正则表达式,有时候优化它们,这是我发现的一般帮助:


  1. 如果您的语言支持,请在循环之外编译正则表达式。

  2. 尽可能添加锚点当你知道它们有用时。特别是字符串开头的 ^ 。另请参阅: Word Boundaries

  3. 避免嵌套重复像瘟疫一样。如果你必须拥有它(你将会这样做),尽量为发动机提供提示以短路任何意外的回溯。

  4. 利用风味结构来加快速度。我偏爱非捕获组占有量词。它们不会出现在每种口味中,但是当它们出现时,你应该使用它们。另请查看原子组

  5. 我总能找到这是真的:你的正则表达式越长,你就越有效率。正则表达式是一个伟大而强大的工具,它们就像一个超级聪明的锤子。 不要陷入将一切视为钉子的陷阱。有时你正在寻找的字符串功能就在你的鼻子底下。

希望这对你有帮助。祝你好运。


I am parsing (species) names of the form:

Parus Ater
H. sapiens
T. rex
Tyr. rex

which normally have two terms (binomial) but sometimes have 3 or more.

Troglodytes troglodytes troglodytes 
E. rubecula sensu stricto

I wrote

[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s*[a-z]+)*

which worked most of the time but occasionally went into an infinite loop. It took some time to track down that it was in the regex matching and then I realised it was a typo and I should have written

[A-Z][a-z]*\.?\s+[a-z][a-z]+(\s+[a-z]+)*

which performs properly.

My questions are:

  • why does this loop happen?
  • is there a way I can check for similar regex errors before running the program? Otherwise it may be difficult to trap them before the prgram is distributed and cause problems.

[Note: I don't need a more general expression for species - there is a formal 100+ line regex specification for Species names - this was just an initial filter].

NOTE: The problem arose because although most names were extracted precisely into 2 or occasionally 3/4 terms (as they were in italics) there were a few false positives (like "Homo sapiens lives in big cities like London") and the match fails at "L".]

NOTE: In debugging this I have found that the regex was often completing but being very slow (e.g. on shorter target strings). It is valuable that I found this bug through a pathological case. I have learnt an important lesson!

解决方案

To address the first part of your question, you should read up on catastrophic backtracking. Essentially, what is happening is there are too many ways to match your regular expression with your string, and the parser is continually back tracking to try and make it work.

In your case, it was probably the nested repitition: (\s*[a-z]+)* Which likely caused some very very strange loops. As Qtax has adeptly pointed out, it's hard to tell without more information.

The second part of your question is, unfortunately, impossible to answer. It's basically the Halting problem. Since Regular Expressions are essentially of a finite state machine whose input is a string, you cannot create a general solution which predicts which regular expressions will backtrack catastrophically, and which will not.

As far as some tips for making your regular expressions run faster? That's a big can of worms. I've spent a lot of time studying regular expressions on my own, and some time optimizing them, and here's what I've found generally helps:

  1. Compile your regular expressions outside of your loops, if your language supports it.
  2. Whenever possible, add anchors when you know they're useful. Especially the ^ for the beginning of the string. See also: Word Boundaries
  3. Avoid nested repetition like the plague. If you have to have it (which you will), do your best to provide hints to the engine to short circuit any unintended backtracking.
  4. Take advantage of flavor constructs to speed things up. I'm partial to Non-Capturing groups and possessive quantifiers. They don't appear in every flavor, but when they do, you should use them. Also check out Atomic Groups
  5. I always find this to be true: The longer your regular expression gets, The more trouble you're going to have making it efficient. Regular expressions are a great and powerful tool, they're like a super smart hammer. Don't fall into the trap of seeing everything as a nail. Sometimes the string function you're looking for is right under your nose.

Hope this helps you. Good luck.

这篇关于正则表达式进入无限循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆