C 风格注释中的 Python 正则表达式阅读 [英] Python Regex reading in c style comments

查看:41
本文介绍了C 风格注释中的 Python 正则表达式阅读的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 c 文件中查找 c 样式注释,但如果引号中碰巧有//则我遇到了麻烦.这是文件:

Im trying to find c style comments in a c file but im having trouble if there happens to be // inside of quotations. This is the file:

/*My function
is great.*/
int j = 0//hello world
void foo(){
    //tricky example
    cout << "This // is // not a comment\n";
}

它将与那个 cout 匹配.这是我到目前为止所拥有的(我已经可以匹配/**/评论)

it will match with that cout. This is what i have so far (i can match the /**/ comments already)

fp = open(s)

p = re.compile(r'//(.+)')
txt = p.findall(fp.read())
print (txt)

推荐答案

第一步是确定///*不能被解释为注释子串的开头.例如,当它们位于字符串(引号之间) 内时.为了避免引号(或其他东西)之间的内容,诀窍是将它们放在一个捕获组中并在替换模式中插入一个反向引用:

The first step is to identify cases where // or /* must not be interpreted as the begining of a comment substring. For example when they are inside a string (between quotes). To avoid content between quotes (or other things), the trick is to put them in a capture group and to insert a backreference in the replacement pattern:

模式:

(
    "(?:[^"\\]|\\[\s\S])*"
  |
    '(?:[^'\\]|\\[\s\S])*'
)
|
//.*
|
/\*(?:[^*]|\*(?!/))*\*/

替换:

\1

在线演示

因为引用的部分是先搜索的,所以每次你找到///*...*/,你就可以确定你的不是在一个字符串里面.

Since quoted parts are searching first, each time you find // or /*...*/, you can be sure that your are not inside a string.

请注意,该模式是自愿的低效(由于 (A|B)* 子模式)以使其更易于理解.为了提高效率,你可以这样重写:

Note that the pattern is voluntary inefficient (due to (A|B)* subpatterns) to make it easier to understand. To make it more efficient you can rewrite it like this:

("(?=((?:[^"\\]+|\\[\s\S])*))\2"|'(?=((?:[^'\\]+|\\[\s\S])*))\3')|//.*|/\*(?=((?:[^*]+|\*(?!/))*))\4\*/

(?=(something+))\1 只是一种模拟 原子组 (?>something+)

在线演示

因此,如果您只想查找注释(而不是删除它们),最方便的是将模式的注释部分放在捕获组中并测试它是否不为空.以下模式已被修改(在 Jonathan Leffler 评论之后) 来处理被预处理器解释为反斜杠字符的三合字母 ??/(我假设代码不是为 -trigraphs<编写的/code> 选项) 并处理反斜杠后跟一个换行符,允许在多行中格式化一行:

So, If you only want to find comments (and not to remove them), the most handy is to put the comments part of the pattern in capture group and to test if it isn't empty. The following pattern has been udapted (after Jonathan Leffler comment) to handle the trigraph ??/ that is interpreted as a backslash character by the preprocessor (I assume that the code isn't written for the -trigraphs option) and to handle the backslash followed by a newline character that allows to format a single line on several lines:

fp = open(s)

p = re.compile(r'''(?x)
(?=["'/])      # trick to make it faster, a kind of anchor
(?:
    "(?=((?:[^"\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\1" # double quotes string
  |
    '(?=((?:[^'\\?]+|\?(?!\?/)|(?:\?\?/|\\)[\s\S])*))\2' # single quotes string
  |
    (
        /(?:(?:\?\?/|\\)\n)*/(?:.*(?:\?\?|\\)/\n)*.* # single line comment
      |
        /(?:(?:\?\?/|\\)\n)*\*                       # multiline comment
        (?=((?:[^*]+|\*+(?!(?:(?:\?\?/|\\)\n)*/))*))\4
        \*(?:(?:\?\?/|\\)\n)*/             
    )
)
''')

for m in p.findall(fp.read()):
    if (m[2]):    
        print m[2]

这些更改不会影响模式效率,因为正则表达式引擎的主要工作是查找以引号或斜杠开头的位置.通过在模式 (?=["'/]) 的开头存在一个前瞻,可以简化此任务,允许内部优化快速找到第一个字符.

These changes would not affect the pattern efficiency since the main work for the regex engine is to find positions that begin with a quote or a slash. This task is simplify by the presence of a lookahead at the begining of the pattern (?=["'/]) that allows internals optimizations to quickly find the first character.

另一个优化是使用模拟原子组,将回溯减少到最低限度,并允许在重复组内使用贪婪量词.

An other optimization is the use of emulated atomic groups, that reduces the backtracking to the minimum and allows to use greedy quantifiers inside repeated groups.

注意:C 中可能没有heredoc 语法!

NB: a chance there is no heredoc syntax in C!

这篇关于C 风格注释中的 Python 正则表达式阅读的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆