正则表达式记号化问题 [英] Regex tokenize issue

查看：134 发布时间：2016/10/8 22:41:23 c# regex tokenize

本文介绍了正则表达式记号化问题的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有由用户输入的字符串，并希望来标记他们。对于这一点，我想用正则表达式，现在有一个问题，一个特例。
的例子字符串是

I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case. An example string is

测试+你好+Good\多+Escape\This\\ \\测试
或C＃相当于

Test + "Hello" + "Good\"more" + "Escape\"This\"Test" or the C# equivalent

@测试+你好+ Good\更多+Escape\This\测试

我我能够匹配测试和 + 标记，而不是由包含的那些的。我用的是让用户指定这是字面上的字符串，而不是一个特殊的记号。现在，如果用户想使用字符串中的字符，我认为让他用\逃避它的。

I am able to match the Test and + tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use the " character in the string, I thought of allowing him to escape it with a \.

所以规则是：给我之间有两个一切，但在过去的前面的字符不能是\

So the rule would be: Give me everything between two " ", but the character in front of the last " can not be a \.

我希望的结果是：你好 Good\多 Escape\This\测试
我需要的字是在最后一场比赛，所以我知道这是一个字符串。

The results I expect are: "Hello" "Good\"more" "Escape\"This\"Test" I need the " " characters to be in the final match so I know that this is a string.

我目前拥有的正则表达式 @（[\w] *）（小于？！\\） 这给了我下面的结果：你好 更多 测试

I currently have the regex @"""([\w]*)(?<!\\"")""" which gives me the following results: "Hello" "more" "Test"

所以，后面看是不是工作，我希望它是。有谁知道正确的方式来获得字符串像我想要的吗？

So the look behind isn't working as I want it to be. Does anyone know the correct way to get the string like I want?

推荐答案

要使它更安全，我建议让所有

To make it safer, I'd suggest getting all the substrings within unescaped pairs of "..." with the following regex:

^(?:[^"\\]*(?:\\.[^"\\]*)*("[^"\\]*(?:\\.[^"\\]*)*"))+

它匹配

^ - 字符串的开始（这样我们就可以检查每个和转义序列）

（？： - 非捕获组1作为为后续的子模式

的容器
- [^\\] *（？：\\。[^ \\] *）* - 比赛0+比其他字符和 \ 随后与0+序列 \\ （任何转义序列），其次比和 \ （因此，我们避免匹配的第一个是逃出来的，它可以与前面任何数量的转义序列）
^ - start of string (so that we could check each " and escape sequence) (?: - Non-capturing group 1 serving as a container for the subsequent subpatterns [^"\\]*(?:\\.[^"\\]*)* - matches 0+ characters other than " and \ followed with 0+ sequences of \\. (any escape sequence) followed with 0+ characters other than " and \ (thus, we avoid matching the first " that is escaped, and it can be preceded with any number of escape sequences) ("[^"\\]*(?:\\.[^"\\]*)*") - Capture group 1 matching "..." substrings that may contain any escape sequences inside 见的正则表达式演示这里是一个 C＃演示： var rx = "^(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"))+"; var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f"""; var matches = Regex.Matches(s, rx) .Cast<Match>() .SelectMany(m => m.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray()) .ToList(); Console.WriteLine(string.Join("\n", matches)); 更新如果您需要删除的标记，只是匹配，并且捕捉他们的一切之外，此代码： If you need to remove the tokens, just match and capture all outside of them with this code: var keep = "[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*"; var rx = string.Format("^(?:(?<keep>{0})\"{0}\")+(?<keep>{0})$", keep); var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f"""; var matches = Regex.Matches(s, rx) .Cast<Match>() .SelectMany(m => m.Groups["keep"].Captures.Cast<Capture>().Select(p => p.Value).ToArray()) .ToList(); Console.WriteLine(string.Join("", matches)); 请参阅的另一个演示输出：测试+ + + \Escape\This\Test\+ 为 @测试+你好+Good\更多+ \Escape\This\测试\+F; 这篇关于正则表达式记号化问题的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！


                    
                        查看全文

正则表达式记号化问题 [英] Regex tokenize issue

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录关闭

正则表达式记号化问题 [英] Regex tokenize issue

问题描述

推荐答案

相关文章

C#/.NET最新文章

热门教程

热门工具

登录 关闭

登录关闭