正则表达式记号化问题 [英] Regex tokenize issue
问题描述
我有由用户输入的字符串,并希望来标记他们。对于这一点,我想用正则表达式,现在有一个问题,一个特例。
的例子字符串是
I have strings input by the user and want to tokenize them. For that, I want to use regex and now have a problem with a special case. An example string is
测试+你好+Good\多+Escape\This\\ \\测试
或C#相当于
Test + "Hello" + "Good\"more" + "Escape\"This\"Test"
or the C# equivalent
@测试+你好+ Good\更多+Escape\This\测试
我我能够匹配测试
和 +
标记,而不是由包含的那些的。我用的是让用户指定这是字面上的字符串,而不是一个特殊的记号。现在,如果用户想使用字符串中的字符,我认为让他用\逃避它的。
I am able to match the Test
and +
tokens, but not the ones contained by the ". I use the " to let the user specify that this is literally a string and not a special token. Now if the user wants to use the " character in the string, I thought of allowing him to escape it with a \.
所以规则是:给我之间有两个一切,但在过去的前面的字符不能是\
So the rule would be: Give me everything between two " ", but the character in front of the last " can not be a \.
我希望的结果是:你好
Good\多
Escape\This\测试
我需要的字是在最后一场比赛,所以我知道这是一个字符串。
The results I expect are: "Hello"
"Good\"more"
"Escape\"This\"Test"
I need the " " characters to be in the final match so I know that this is a string.
我目前拥有的正则表达式 @([\w] *)(小于?!\\)
这给了我下面的结果:你好
更多
测试
I currently have the regex @"""([\w]*)(?<!\\"")"""
which gives me the following results: "Hello"
"more"
"Test"
所以,后面看是不是工作,我希望它是。有谁知道正确的方式来获得字符串像我想要的吗?
So the look behind isn't working as I want it to be. Does anyone know the correct way to get the string like I want?
推荐答案
要使它更安全,我建议让所有
To make it safer, I'd suggest getting all the substrings within unescaped pairs of "..."
with the following regex:
^(?:[^"\\]*(?:\\.[^"\\]*)*("[^"\\]*(?:\\.[^"\\]*)*"))+
它匹配
-
^
- 字符串的开始(这样我们就可以检查每个和转义序列)
-
(?:
- 非捕获组1作为为后续的子模式
的容器-
[^\\] *(?:\\。[^ \\] *)*
- 比赛0+比其他字符和
\
随后与0+序列\\
(任何转义序列),其次比$ C $其他0+字符C>和
\
(因此,我们避免匹配的第一个是逃出来的,它可以与前面任何数量的转义序列)
-
([^\\] *(?:\\。[^\\] *)* - 可能包含
$ b $内的任何转义序列)拍摄组1匹配
...
子b^
- start of string (so that we could check each"
and escape sequence)(?:
- Non-capturing group 1 serving as a container for the subsequent subpatterns[^"\\]*(?:\\.[^"\\]*)*
- matches 0+ characters other than"
and\
followed with 0+ sequences of\\.
(any escape sequence) followed with 0+ characters other than"
and\
(thus, we avoid matching the first"
that is escaped, and it can be preceded with any number of escape sequences)("[^"\\]*(?:\\.[^"\\]*)*")
- Capture group 1 matching"..."
substrings that may contain any escape sequences inside
var rx = "^(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"))+"; var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f"""; var matches = Regex.Matches(s, rx) .Cast<Match>() .SelectMany(m => m.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray()) .ToList(); Console.WriteLine(string.Join("\n", matches));
更新
如果您需要删除的标记,只是匹配,并且捕捉他们的一切之外,此代码:
If you need to remove the tokens, just match and capture all outside of them with this code:
var keep = "[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*"; var rx = string.Format("^(?:(?<keep>{0})\"{0}\")+(?<keep>{0})$", keep); var s = @"Test + ""Hello"" + ""Good\""more"" + \""Escape\""This\""Test\"" + ""f"""; var matches = Regex.Matches(s, rx) .Cast<Match>() .SelectMany(m => m.Groups["keep"].Captures.Cast<Capture>().Select(p => p.Value).ToArray()) .ToList(); Console.WriteLine(string.Join("", matches));
请参阅的另一个演示
输出:
测试+ + + \Escape\This\Test\+
为@测试+你好+Good\更多+ \Escape\This\测试\+F;
这篇关于正则表达式记号化问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
-