正则表达式,用于匹配C ++字符串常量 [英] Regex for matching C++ string constant

查看:323
本文介绍了正则表达式,用于匹配C ++字符串常量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用C ++预处理程序,我需要将字符串常量与0个以上的字母进行匹配,例如嘿,我是字符串
我目前正在这里使用此工具 \([[^ \\\] + | \\。)+ \

I'm currently working on a C++ preprocessor and I need to match string constants with more than 0 letters like this "hey I'm a string. I'm currently working with this one here \"([^\\\"]+|\\.)+\" but it fails on one of my test cases.

测试用例:

std::cout << "hello" << " world";
std::cout << "He said: \"bananas\"" << "...";
std::cout << "";
std::cout << "\x12\23\x34";

预期输出:

std::cout << String("hello") << String(" world");
std::cout << String("He said: \"bananas\"") << String("...");
std::cout << "";
std::cout << String("\x12\23\x34");

在第二个中,我得到

std::cout << String("He said: \")bananas\"String(" << ")...";

简短的复制代码(使用AR.3的正则表达式):

Short repro code (using the regex by AR.3):

std::string in_line = "std::cout << \"He said: \\\"bananas\\\"\" << \"...\";";
std::regex r("\"([^\"]+|\\.|(?<=\\\\)\")+\"");
in_line = std::regex_replace(in_line, r, "String($&)");


推荐答案

将源文件列为对正则表达式来说是个好工作。但是对于这样的任务,让我们使用比 std :: regex 更好的正则表达式引擎。首先,我们使用PCRE(或 boost :: regex )。在本文的结尾,我将展示使用功能较少的引擎可以做什么。

Lexing a source file is a good job for regexes. But for such a task, let's use a better regex engine than std::regex. Let's use PCRE (or boost::regex) at first. At the end of this post, I'll show what you can do with a less feature-packed engine.

我们只需要进行部分词法分析,忽略所有无法识别的标记不会影响字符串文字。我们需要处理的是:

We only need to do partial lexing, ignoring all unrecognized tokens that won't affect string literals. What we need to handle is:


  • 单行注释

  • 多行注释

  • 字符文字

  • 字符串文字

  • Singleline comments
  • Multiline comments
  • Character literals
  • String literals

我们将使用扩展( x )选项,该选项将忽略模式中的空格。

We'll be using the extended (x) option, which ignores whitespace in the pattern.

这是 [lex.comment] 所说的内容:


字符 / * 开始注释,并以字符 * / 结尾。这些注释不嵌套。
字符 // 开始一个注释,该注释在下一个换行符之前终止。如果
的注释中包含换页符或垂直制表符,则在
与终止注释的换行符之间只能出现空格字符;无需诊断。 [注意:注释
个字符 // / * * / // 注释内没有特殊含义,并且与其他
字符一样对待。同样,注释字符 // / * / *中没有特殊含义评论。
—尾注]

The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates immediately before the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment. — end note ]



# singleline comment
// .* (*SKIP)(*FAIL)

# multiline comment
| /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)

轻松自在。如果您在那里匹配任何内容,只需(* SKIP)(* FAIL)-表示您将匹配项丢弃。 (?s:。*?) s (单行)修饰符应用于

Easy peasy. If you match anything there, just (*SKIP)(*FAIL) - meaning that you throw away the match. The (?s: .*? ) applies the s (singleline) modifier to the . metacharacter, meaning it's allowed to match newlines.

此处是 [lex.ccon] 的语法:


 character-literal:  
    encoding-prefix(opt) ’ c-char-sequence ’
  encoding-prefix:
    one of u8 u U L
  c-char-sequence:
    c-char
    c-char-sequence c-char
  c-char:
    any member of the source character set except the single-quote ’, backslash \, or new-line character
    escape-sequence
    universal-character-name
  escape-sequence:
    simple-escape-sequence
    octal-escape-sequence
    hexadecimal-escape-sequence
  simple-escape-sequence: one of \’ \" \? \\ \a \b \f \n \r \t \v
  octal-escape-sequence:
    \ octal-digit
    \ octal-digit octal-digit
    \ octal-digit octal-digit octal-digit
  hexadecimal-escape-sequence:
    \x hexadecimal-digit
    hexadecimal-escape-sequence hexadecimal-digit


让我们定义 几件事首先,稍后我们需要:

Let's define a few things first, which we'll need later on:

(?(DEFINE)
  (?<prefix> (?:u8?|U|L)? )
  (?<escape> \\ (?:
    ['"?\\abfnrtv]         # simple escape
    | [0-7]{1,3}           # octal escape
    | x [0-9a-fA-F]{1,2}   # hex escape
    | u [0-9a-fA-F]{4}     # universal character name
    | U [0-9a-fA-F]{8}     # universal character name
  ))
)




  • 前缀被定义为可选的 u8 u U L

  • 转义是根据标准定义的,除了我已经合并了通用字符名为了简单起见将其插入

    • prefix is defined as an optional u8, u, U or L
    • escape is defined as per the standard, except that I've merged universal-character-name into it for the sake of simplicity
    • 一旦有了这些,字符文字就非常简单:

      Once we have these, a character literal is pretty simple:

      (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
      

      我们把它扔了离开(* SKIP)(* FAIL)

      T的定义方式几乎与字符字面量相同。这是 [lex.string] 的一部分:

      They're defined in almost the same way as character literals. Here's a part of [lex.string]:


        string-literal:
          encoding-prefix(opt) " s-char-sequence(opt) "
          encoding-prefix(opt) R raw-string
        s-char-sequence:
          s-char
          s-char-sequence s-char
        s-char:
          any member of the source character set except the double-quote ", backslash \, or new-line character
          escape-sequence
          universal-character-name
      


      这将反映字符文字:

      (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
      

      区别在于:


      • 字符序列是可选的这次( * 而不是 +

      • 不允许双引号在不转义而不是单引号的情况下

      • 我们真的不要丢掉它:)

      • The character sequence is optional this time (* instead of +)
      • The double quote is disallowed when unescaped instead of the single quote
      • We actually don't throw it away :)

      这是原始字符串部分:


        raw-string:
          " d-char-sequence(opt) ( r-char-sequence(opt) ) d-char-sequence(opt) "
        r-char-sequence:
          r-char
          r-char-sequence r-char
        r-char:
          any member of the source character set, except a right parenthesis )
          followed by the initial d-char-sequence (which may be empty) followed by a double quote ".
        d-char-sequence:
          d-char
          d-char-sequence d-char
        d-char:
          any member of the basic source character set except:
          space, the left parenthesis (, the right parenthesis ), the backslash \,
          and the control characters representing horizontal tab,
          vertical tab, form feed, and newline.
      


      此正则表达式为:

      (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
      




      • [^()\\\t\x0B\r\n] * 是定界符( d-char

      • \中允许的字符集k< delimiter> 是指先前匹配的定界符

        • [^ ()\\\t\x0B\r\n]* is the set of characters that are allowed in delimiters (d-char)
        • \k<delimiter> refers to the previously matched delimiter
        • 完整模式是:

          (?(DEFINE)
            (?<prefix> (?:u8?|U|L)? )
            (?<escape> \\ (?:
              ['"?\\abfnrtv]         # simple escape
              | [0-7]{1,3}           # octal escape
              | x [0-9a-fA-F]{1,2}   # hex escape
              | u [0-9a-fA-F]{4}     # universal character name
              | U [0-9a-fA-F]{8}     # universal character name
            ))
          )
          
          # singleline comment
          // .* (*SKIP)(*FAIL)
          
          # multiline comment
          | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
          
          # character literal
          | (?&prefix) ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
          
          # standard string
          | (?&prefix) " (?> (?&escape) | [^"\\\r\n]+ )* "
          
          # raw string
          | (?&prefix) R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
          

          请参见此处的演示

          这是一个使用 boost :: regex

          Here's a simple demo program using boost::regex:

          #include <string>
          #include <iostream>
          #include <boost/regex.hpp>
          
          static void test()
          {
              boost::regex re(R"regex(
                  (?(DEFINE)
                    (?<prefix> (?:u8?|U|L) )
                    (?<escape> \\ (?:
                      ['"?\\abfnrtv]         # simple escape
                      | [0-7]{1,3}           # octal escape
                      | x [0-9a-fA-F]{1,2}   # hex escape
                      | u [0-9a-fA-F]{4}     # universal character name
                      | U [0-9a-fA-F]{8}     # universal character name
                    ))
                  )
          
                  # singleline comment
                  // .* (*SKIP)(*FAIL)
          
                  # multiline comment
                  | /\* (?s: .*? ) \*/ (*SKIP)(*FAIL)
          
                  # character literal
                  | (?&prefix)? ' (?> (?&escape) | [^'\\\r\n]+ )+ ' (*SKIP)(*FAIL)
          
                  # standard string
                  | (?&prefix)? " (?> (?&escape) | [^"\\\r\n]+ )* "
          
                  # raw string
                  | (?&prefix)? R " (?<delimiter>[^ ()\\\t\x0B\r\n]*) \( (?s:.*?) \) \k<delimiter> "
              )regex", boost::regex::perl | boost::regex::no_mod_s | boost::regex::mod_x | boost::regex::optimize);
          
              std::string subject(R"subject(
          std::cout << L"hello" << " world";
          std::cout << "He said: \"bananas\"" << "...";
          std::cout << "";
          std::cout << "\x12\23\x34";
          std::cout << u8R"hello(this"is\a\""""single\\(valid)"
          raw string literal)hello";
          
          "" // empty string
          '"' // character literal
          
          // this is "a string literal" in a comment
          /* this is
             "also inside"
             //a comment */
          
          // and this /*
          "is not in a comment"
          // */
          
          "this is a /* string */ with nested // comments"
              )subject");
          
              std::cout << boost::regex_replace(subject, re, "String\\($&\\)", boost::format_all) << std::endl;
          }
          
          int main(int argc, char **argv)
          {
              try
              {
                  test();
              }
              catch(std::exception ex)
              {
                  std::cerr << ex.what() << std::endl;
              }
          
              return 0;
          }
          

          (我禁用了语法突出显示功能,因为它在此代码上很疯狂)

          (I left syntax highlighting disabled because it goes nuts on this code)

          由于某种原因,我不得不从前缀量词$ c>(将(?< prefix>(?:u8?| U | L)?)更改为(?< prefix>( ?:u8?| U | L))(?& prefix)(?& prefix) ?)来使模式起作用。我相信这是boost :: regex中的一个错误,因为PCRE和Perl都可以与原始模式一起正常工作。

          For some reason, I had to take the ? quantifier out of prefix (change (?<prefix> (?:u8?|U|L)? ) to (?<prefix> (?:u8?|U|L) ) and (?&prefix) to (?&prefix)?) to make the pattern work. I believe it's a bug in boost::regex, as both PCRE and Perl work just fine with the original pattern.

          请注意,尽管此模式在技术上使用递归,但它从不嵌套递归调用。可以通过将相关的可重用部分插入主模式来避免递归。

          Note that while this pattern technically uses recursion, it never nests recursive calls. Recursion could be avoided by inlining the relevant reusable parts into the main pattern.

          可以避免其他两个构造,但是会降低性能。我们可以安全地将原子组(?> ... 替换为普通组 (?: ... 如果我们不嵌套量词以避免灾难性回溯

          A couple of other constructs can be avoided at the price of reduced performance. We can safely replace the atomic groups (?>...) with normal groups (?:...) if we don't nest quantifiers in order to avoid catastrophic backtracking.

          我们也可以避免(*跳过)(* FAIL),如果我们在替换函数中添加一行逻辑:跳过的所有替代项都分组在捕获组中。如果捕获组匹配,则忽略匹配。

          We can also avoid (*SKIP)(*FAIL) if we add one line of logic into the replacement function: All the alternatives to skip are grouped in a capturing group. If the capturing group matched, just ignore the match. If not, then it's a string literal.

          所有这些都意味着我们可以在JavaScript中实现此功能,它具有您可以找到的最简单的正则表达式引擎之一,但价格不菲打破DRY规则并使图案难以辨认。转换后,正则表达式就变成了这种怪物:

          All of this means we can implement this in JavaScript, which has one of the simplest regex engines you can find, at the price of breaking the DRY rule and making the pattern illegible. The regex becomes this monstrosity once converted:

          (\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"
          

          这是一个可以与您互动的演示:

          And here's an interactive demo you can play with:

          function run() {
              var re = /(\/\/.*|\/\*[\s\S]*?\*\/|(?:u8?|U|L)?'(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^'\\\r\n])+')|(?:u8?|U|L)?"(?:\\(?:['"?\\abfnrtv]|[0-7]{1,3}|x[0-9a-fA-F]{1,2}|u[0-9a-fA-F]{4}|U[0-9a-fA-F]{8})|[^"\\\r\n])*"|(?:u8?|U|L)?R"([^ ()\\\t\x0B\r\n]*)\([\s\S]*?\)\2"/g;
              
              var input = document.getElementById("input").value;
              var output = input.replace(re, function(m, ignore) {
                  return ignore ? m : "String(" + m + ")";
              });
              document.getElementById("output").innerText = output;
          }
          
          document.getElementById("input").addEventListener("input", run);
          run();

          <h2>Input:</h2>
          <textarea id="input" style="width: 100%; height: 50px;">
          std::cout << L"hello" << " world";
          std::cout << "He said: \"bananas\"" << "...";
          std::cout << "";
          std::cout << "\x12\23\x34";
          std::cout << u8R"hello(this"is\a\""""single\\(valid)"
          raw string literal)hello";
          
          "" // empty string
          '"' // character literal
          
          // this is "a string literal" in a comment
          /* this is
             "also inside"
             //a comment */
          
          // and this /*
          "is not in a comment"
          // */
          
          "this is a /* string */ with nested // comments"
          </textarea>
          <h2>Output:</h2>
          <pre id="output"></pre>

          这篇关于正则表达式,用于匹配C ++字符串常量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆