改进/修复 C 风格块注释的正则表达式 [英] Improving/Fixing a Regex for C style block comments

查看:23
本文介绍了改进/修复 C 风格块注释的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写(用 C#)一个简单的解析器来处理看起来很像经典 C 的脚本语言.

I'm writing (in C#) a simple parser to process a scripting language that looks a lot like classic C.

在我拥有的一个脚本文件中,我用来识别/* 块注释 */的正则表达式正在进入某种无限循环,占用 100% CPU 很长时间.

On one script file I have, the regular expression that I'm using to recognize /* block comments */ is going into some kind of infinite loop, taking 100% CPU for ages.

我使用的正则表达式是这样的:

The Regex I'm using is this:

/*([^*]|[
]|(*+([^*/]|[
])))**+/

关于为什么这可能会被锁定的任何建议?

Any suggestions on why this might get locked up?

或者,我可以使用的另一个正则表达式是什么?

Alternatively, what's another Regex I could use instead?

更多信息:

  • 使用面向 .NET 3.5 的 C# 3.0;
  • 我正在使用 Regex.Match(string,int) 方法在字符串的特定索引处开始匹配;
  • 我已经让程序运行了一个多小时,但比赛还没有完成;
  • 传递给 Regex 构造函数的选项是 RegexOptions.MultilineRegexOptions.IgnorePatternWhitespace;
  • 正则表达式适用于我的 453 个测试文件中的 452 个.

推荐答案

我发现您的正则表达式存在一些问题:

Some problems I see with your regex:

您的正则表达式中不需要 |[ ] 序列;像 [^*] 这样的否定字符类匹配除 * 之外的所有内容,包括行分隔符.只有 .(点)元字符与那些不匹配.

There's no need for the |[ ] sequences in your regex; a negated character class like [^*] matches everything except *, including line separators. It's only the . (dot) metacharacter that doesn't match those.

进入评论后,唯一需要查找的字符就是星号;只要您没有看到其中一个,您就可以吞噬任意数量的字符.这意味着当您可以使用 [^*]+ 代替时,使用 [^*] 是没有意义的.事实上,你最好把它放在一个原子组中——(?>[^*]+)——因为你永远没有任何理由放弃那些不——匹配后,星号.

Once you're inside the comment, the only character you have to look for is an asterisk; as long as you don't see one of those, you can gobble up as many characters you want. That means it makes no sense to use [^*] when you can use [^*]+ instead. In fact, you might as well put that in an atomic group -- (?>[^*]+) -- because you'll never have any reason to give up any of those not-asterisks once you've matched them.

过滤掉多余的垃圾,最外层括号中的最后一个选择是 *+[^*/],这意味着一个或多个星号,后跟一个不是星号的字符或斜线".这将始终与注释末尾的星号匹配,并且总是不得不再次放弃它,因为下一个字符是斜杠.事实上,如果在最后一个斜杠前有 20 个星号,那么你的正则表达式的那部分将匹配它们,然后它会一个一个地放弃它们.然后最后一部分 -- *+/ -- 将匹配它们以保持.

Filtering out extraneous junk, the final alternative inside your outermost parens is *+[^*/], which means "one or more asterisks, followed by a character that isn't an asterisk or a slash". That will always match the asterisk at the end of the comment, and it will always have to give it up again because the next character is a slash. In fact, if there are twenty asterisks leading up to the final slash, that part of your regex will match them all, then it will give them all up, one by one. Then the final part -- *+/ -- will match them for keeps.

为了获得最佳性能,我会使用这个正则表达式:

For maximum performance, I would use this regex:

/*(?>(?:(?>[^*]+)|*(?!/))*)*/

这将很快匹配格式正确的评论,但更重要的是,如果它开始匹配不是有效评论的内容,它将尽快失败.

This will match a well-formed comment very quickly, but more importantly, if it starts to match something that isn't a valid comment, it will fail as quickly as possible.

大卫提供,这是一个匹配具有任何嵌套级别的嵌套注释的版本:

Courtesy of David, here's a version that matches nested comments with any level of nesting:

(?s)/*(?>/*(?<LEVEL>)|*/(?<-LEVEL>)|(?!/*|*/).)+(?(LEVEL)(?!))*/

它使用 .NET 的平衡组,所以它不会在任何其他风格下工作.为了完整起见,这里有另一个版本(来自 RegexBuddy 的库),它使用 Perl、PCRE 和 Oniguruma/Onigmo 支持的递归组语法:

It uses .NET's Balancing Groups, so it won't work in any other flavor. For the sake of completeness, here's another version (from RegexBuddy's Library) that uses the Recursive Groups syntax supported by Perl, PCRE and Oniguruma/Onigmo:

/*(?>[^*/]+|*[^/]|/[^*])*(?>(?R)(?>[^*/]+|*[^/]|/[^*])*)**/

这篇关于改进/修复 C 风格块注释的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆