使用RegEx正确匹配,但无需替换即可正常工作 [英] Correct match using RegEx but it should work without substitution

查看:78
本文介绍了使用RegEx正确匹配,但无需替换即可正常工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有 <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] 来捕获

<autorpodpis>_this_is_an_example_of_what_I'd_like_to_match<

如果在冒号或分号之前有空格,冒号(;)或分号(;)或空格,则我的RegEx会捕获所有内容,但包括这些字符-请参阅我的链接.它可以按预期运行.

If there is a space, a colon (;) or a semicolon (;) or a space before a colon or a semicolon, my RegEx catches everything but including these characters – see my link. It works as it is expected to.

总体而言,RegEx可以很好地与替换\1配合使用(或在我使用的AutoHotKey中使用– $1).但是我想要不使用替代的比赛.

Overall, the RegEx works fine with substitution \1 (or in AutoHotKey I use – $1). But I'd like match without using substitution.

推荐答案

您似乎混用了 substitution (基于正则表达式的替换操作)和 captureing (存储零件)捕获的匹配值的一部分,该模式的一部分被带编号的或命名的堆栈内的一对未转义的括号括起来).

You seem to mix the terms substitution (regex based replacement operation) and capturing (storing a part of the matched value captured with a part of a pattern enclosed with a pair of unescaped parentheses inside a numbered or named stack).

如果您只想在特定上下文中匹配子字符串而不捕获任何子值,则可以考虑使用 环顾四周 (向后看或向前看).

If you want to just match a substring in specific context without capturing any subvalues, you might consider using lookarounds (lookbehind or lookahead).

在您的情况下,由于您需要在某个已知字符串后的字符串之后进行匹配,因此需要 lookbehind .后面的告诉正则表达式引擎暂时向后退字符串,以检查后面的内部文本是否可以匹配.

In your case, since you need to match a string after some known string, you need a lookbehind. A lookbehind tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.

因此,您可以使用

pos := RegExMatch(input, "(?<=<autorpodpis>)\p{L}+(?:\s+\p{L}+)*", Res)

因此,如果您提供<autorpodpis>WOJCIECH ZAŁUSKA</autorpodpis>作为输入,则Res应该具有WOJCIECH ZAŁUSKA.

So, the Res should have WOJCIECH ZAŁUSKA if you supply <autorpodpis>WOJCIECH ZAŁUSKA</autorpodpis> as input.

说明:

  • (?<=<autorpodpis>)-检查在当前测试位置之前是否有<autorpodpis>.如果没有匹配项,则匹配失败,请继续输入字符串中的下一个位置
  • \p{L}+-1个以上Unicode字母
  • (?:\s+\p{L}+)*-0+个1+空格序列,后跟1+ Unicode字母.
  • (?<=<autorpodpis>) - check if there is <autorpodpis> right before the currently tested location. If there is none, fail this match, go on to the next location in string
  • \p{L}+ - 1+ Unicode letters
  • (?:\s+\p{L}+)* - 0+ sequences of 1+ whitespaces followed with 1+ Unicode letters.

但是,在大多数情况下,并且总是在这样的情况下,当知道后向模式时,后向后移是未锚定的(例如,当它是模式中的第一个子模式时),您不需要重叠匹配,请使用捕获.

However, in most cases, and always in cases like this when the pattern in the lookbehind is known, the lookbehind is unanchored (say, when it is the first subpattern in the pattern) and you do not need overlapping matches, use capturing.

装有 捕获 的版本:

pos := RegExMatch(input, "<autorpodpis>(\p{L}+(?:\s+\p{L}+)*)", Res)

,然后Res[1]将保留WOJCIECH ZAŁUSKA值.在大多数情况下,捕获速度更快(96%).

And then Res[1] will hold the WOJCIECH ZAŁUSKA value. Capturing is in most cases (96%) faster.

现在,您的正则表达式-<autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r]-效率不高,因为[^;,<\n\r]也匹配\s,而\s匹配[;,<\n\r].我的正则表达式是 linear ,每个后续子模式均与前一个子模式不匹配.

Now, your regex - <autorpodpis>([^;,<\n\r]*?)\s*[;,<\n\r] - is not efficient as the [^;,<\n\r] also matches \s and \s matches [;,<\n\r]. My regex is linear, each subsequent subpattern does not match the previous one.

这篇关于使用RegEx正确匹配,但无需替换即可正常工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆