正则表达式记号化问题 [英] Regex tokenize issue

查看:134
本文介绍了正则表达式记号化问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有由用户输入的字符串,并希望来标记他们。对于这一点,我想用正则表达式,现在有一个问题,一个特例。
的例子字符串是

I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case. An example string is

测试+你好+Good\多+Escape\This\\ \\测试
或C#相当于

Test + "Hello" + "Good\"more" + "Escape\"This\"Test" or the C# equivalent

@测试+你好+ Good\更多+Escape\This\测试

我我能够匹配测试 + 标记,而不是由包含的那些的。我用的是让用户指定这是字面上的字符串,而不是一个特殊的记号。现在,如果用户想使用字符串中的字符,我认为让他用\逃避它的。

I am able to match the Test and + tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use the " character in the string, I thought of allowing him to escape it with a \.

所以规则是:给我之间有两个一切,但在过去的前面的字符不能是\

So the rule would be: Give me everything between two " ", but the character in front of the last " can not be a \.

我希望的结果是:你好 Good\多 Escape\This\测试
我需要的字是在最后一场比赛,所以我知道这是一个字符串。

The results I expect are: "Hello" "Good\"more" "Escape\"This\"Test" I need the " " characters to be in the final match so I know that this is a string.

我目前拥有的正则表达式 @([\w] *)(小于?!\\) 这给了我下面的结果:你好 更多 测试

I currently have the regex @"""([\w]*)(?<!\\"")""" which gives me the following results: "Hello" "more" "Test"

所以,后面看是不是工作,我希望它是。有谁知道正确的方式来获得字符串像我想要的吗?

So the look behind isn't working as I want it to be. Does anyone know the correct way to get the string like I want?

推荐答案

要使它更安全,我建议让所有

To make it safer, I'd suggest getting all the substrings within unescaped pairs of "..." with the following regex:

^(?:[^"\\]*(?:\\.[^"\\]*)*("[^"\\]*(?:\\.[^"\\]*)*"))+

它匹配


  • ^ - 字符串的开始(这样我们就可以检查每个和转义序列)

  • (?: - 非捕获组1作为为后续的子模式

    的容器

    • [^\\] *(?:\\。[^ \\] *)* - 比赛0+比其他字符 \ 随后与0+序列 \\ (任何转义序列),其次比 \ (因此,我们避免匹配的第一个是逃出来的,它可以与前面任何数量的转义序列)

    • ([^\\] *(?:\\。[^\\] *)* - 可能包含
    • $ b $内的任何转义序列)拍摄组1匹配...子b
    • ^ - start of string (so that we could check each " and escape sequence)
    • (?: - Non-capturing group 1 serving as a container for the subsequent subpatterns
      • [^"\\]*(?:\\.[^"\\]*)* - matches 0+ characters other than " and \ followed with 0+ sequences of \\. (any escape sequence) followed with 0+ characters other than " and \ (thus, we avoid matching the first " that is escaped, and it can be preceded with any number of escape sequences)
      • ("[^"\\]*(?:\\.[^"\\]*)*") - Capture group 1 matching "..." substrings that may contain any escape sequences inside

      见的正则表达式演示这里是一个 C#演示

      var rx = "^(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"))+";
      var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
      var matches = Regex.Matches(s, rx)
              .Cast<Match>()
              .SelectMany(m => m.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
              .ToList();
      Console.WriteLine(string.Join("\n", matches));
      



      更新

      如果您需要删除的标记,只是匹配,并且捕捉他们的一切之外,此代码:

      If you need to remove the tokens, just match and capture all outside of them with this code:

      var keep = "[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*";
      var rx = string.Format("^(?:(?<keep>{0})\"{0}\")+(?<keep>{0})$", keep);
      var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f""";
      var matches = Regex.Matches(s, rx)
              .Cast<Match>()
              .SelectMany(m => m.Groups["keep"].Captures.Cast<Capture>().Select(p => p.Value).ToArray())
              .ToList();
      Console.WriteLine(string.Join("", matches));
      

      请参阅的另一个演示

      输出:测试+ + + \Escape\This\Test\+ @测试+你好+Good\更多+ \Escape\This\测试\+F;

      这篇关于正则表达式记号化问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆