使用Regex解析C样式注释,避免回溯 [英] Parse C-Style Comments with Regex, avoid Backtracking
问题描述
我想匹配JavaScript文件中的所有块和多行注释(这些是C样式注释).我有一个运作良好的模式.但是,它会产生一些回溯,从而大大降低速度,尤其是在较大的文件上.
I want to match all block and multiline comments in a JavaScript file (these are C-Style comments). I have a pattern that works well. However, it creates some backtracking which slows it down significantly, especially on larger files.
模式:\/\*(?:.|[\r\n])*?\*\/|(?:\/\/.*)
示例: https://www.regex101.com/r/pR6eH6/2
如何避免回溯?
推荐答案
由于交替,您的回溯很重.您可以考虑使用字符类[\s\S]
代替(?:.|[\r\n])
,从而将性能显着提高:
You have heavy backtracking because of the alternation. Instead of the (?:.|[\r\n])
, you may consider using a character class [\s\S]
that boosts performance to a noticeable extent:
\/\*[\s\S]*?\*\/|\/\/.*
请参见演示
在Python中,您也可以使用re.S
/re.DOTALL
修饰符使.
匹配换行符(请注意,单行注释模式应与\/\/[^\r\n]*
匹配):
In Python, you can use the re.S
/re.DOTALL
modifier to make .
match line breaks, too (note that the single line comment pattern should be matched with \/\/[^\r\n]*
then):
/\*.*?\*/|//[^\r\n]*
请参见另一个演示
但是,由于*?
惰性量词也会导致类似于贪婪量词所引起的开销,因此对于 C样式多行注释-/\*[^*]*\*+(?:[^/*][^*]*\*+)*/
,以及整个正则表达式现在看起来像:
However, since *?
lazy quantifier will also cause an overhead similar to the one caused by greedy quantifiers, you should consider using a much more optimal pattern for C style multiline comments - /\*[^*]*\*+(?:[^/*][^*]*\*+)*/
, and the whole regex will now look like:
/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*
请参见另一个演示
详细信息:
-
/\*
-一个/*
-
[^*]*
-除*
之外的零个或多个字符
-
\*+
-一个或多个星号 -
(?:[^/*][^*]*\*+)*
-零个或多个序列:-
[^/*]
-除/
和*
以外的其他符号
-
[^*]*
-除*
以外的零个或多个符号
-
\*+
-1个以上星号
/\*
- a/*
[^*]*
- zero or more chars other than*
\*+
- one or more asterisks(?:[^/*][^*]*\*+)*
- zero or more sequences of:[^/*]
- a symbol other than/
and*
[^*]*
- zero or more symbols other than*
\*+
- 1+ asterisks
只是要注意,在Python中,您不需要转义
/
(在JS中,使用RegExp构造函数声明正则表达式时,您无需转义/
).Just wanted to note that in Python, you do not need to escape
/
(in JS, you do not need to escape/
when declaring a regex using the RegExp constuctor).注意:最后一个模式不允许简单地捕获
/*
和*/
内部的内容,但是由于该模式比其他模式更稳定,因此我建议即使在以下情况下也要使用它:您需要使用尾随*
-/\*([^*]*\*+(?:[^/*][^*]*\*+)*)/|//(.*)
-捕获内容,然后需要从.group(1)
中删除最后一个字符.NOTE: The last pattern does not allow simple capturing what is inside
/*
and*/
, but since the pattern is more stable than the rest, I'd advise using it even when you need to capture the contents with the trailing*
-/\*([^*]*\*+(?:[^/*][^*]*\*+)*)/|//(.*)
- and then you'd need to remove the last char from.group(1)
.这篇关于使用Regex解析C样式注释,避免回溯的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-