使用Regex解析C样式注释,避免回溯 [英] Parse C-Style Comments with Regex, avoid Backtracking

查看:85
本文介绍了使用Regex解析C样式注释,避免回溯的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想匹配JavaScript文件中的所有块和多行注释(这些是C样式注释).我有一个运作良好的模式.但是,它会产生一些回溯,从而大大降低速度,尤其是在较大的文件上.

I want to match all block and multiline comments in a JavaScript file (these are C-Style comments). I have a pattern that works well. However, it creates some backtracking which slows it down significantly, especially on larger files.

模式:\/\*(?:.|[\r\n])*?\*\/|(?:\/\/.*)

示例: https://www.regex101.com/r/pR6eH6/2

如何避免回溯?

推荐答案

由于交替,您的回溯很重.您可以考虑使用字符类[\s\S]代替(?:.|[\r\n]),从而将性能显着提高:

You have heavy backtracking because of the alternation. Instead of the (?:.|[\r\n]), you may consider using a character class [\s\S] that boosts performance to a noticeable extent:

\/\*[\s\S]*?\*\/|\/\/.*

请参见演示

在Python中,您也可以使用re.S/re.DOTALL修饰符使.匹配换行符(请注意,单行注释模式应与\/\/[^\r\n]*匹配):

In Python, you can use the re.S/re.DOTALL modifier to make . match line breaks, too (note that the single line comment pattern should be matched with \/\/[^\r\n]* then):

/\*.*?\*/|//[^\r\n]*

请参见另一个演示

但是,由于*? 惰性量词也会导致类似于贪婪量词所引起的开销,因此对于 C样式多行注释-/\*[^*]*\*+(?:[^/*][^*]*\*+)*/,以及整个正则表达式现在看起来像:

However, since *? lazy quantifier will also cause an overhead similar to the one caused by greedy quantifiers, you should consider using a much more optimal pattern for C style multiline comments - /\*[^*]*\*+(?:[^/*][^*]*\*+)*/, and the whole regex will now look like:

/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//.*

请参见另一个演示

详细信息:

  • /\*-一个/*
  • [^*]*-除*
  • 之外的零个或多个字符
  • \*+-一个或多个星号
  • (?:[^/*][^*]*\*+)*-零个或多个序列:
    • [^/*]-除/*
    • 以外的其他符号
    • [^*]*-除*
    • 以外的零个或多个符号
    • \*+-1个以上星号
    • /\* - a /*
    • [^*]* - zero or more chars other than *
    • \*+ - one or more asterisks
    • (?:[^/*][^*]*\*+)* - zero or more sequences of:
      • [^/*] - a symbol other than / and *
      • [^*]* - zero or more symbols other than *
      • \*+ - 1+ asterisks

      只是要注意,在Python中,您不需要转义/(在JS中,使用RegExp构造函数声明正则表达式时,您无需转义/).

      Just wanted to note that in Python, you do not need to escape / (in JS, you do not need to escape / when declaring a regex using the RegExp constuctor).

      注意:最后一个模式不允许简单地捕获/**/内部的内容,但是由于该模式比其他模式更稳定,因此我建议即使在以下情况下也要使用它:您需要使用尾随*-/\*([^*]*\*+(?:[^/*][^*]*\*+)*)/|//(.*)-捕获内容,然后需要从.group(1)中删除最后一个字符.

      NOTE: The last pattern does not allow simple capturing what is inside /* and */, but since the pattern is more stable than the rest, I'd advise using it even when you need to capture the contents with the trailing * - /\*([^*]*\*+(?:[^/*][^*]*\*+)*)/|//(.*) - and then you'd need to remove the last char from .group(1).

      这篇关于使用Regex解析C样式注释,避免回溯的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆