折叠并捕获单个Regex表达式中的重复模式 [英] Collapse and Capture a Repeating Pattern in a Single Regex Expression

查看:60
本文介绍了折叠并捕获单个Regex表达式中的重复模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直遇到需要从字符串中捕获大量令牌的情况,经过无数次尝试之后,我找不到简化该过程的方法.

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.

所以我们说的是:

start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end

start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end

此示例内部有8个项目,但说可能有3至10个项目.

This example has 8 items inside, but say it could have between 3 and 10 items.

理想情况下,我想要这样的东西:
start:(?:(\w+)-?){3,10}:end整洁但仅捕获最后一个匹配项. 请参阅此处

I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here

我通常在简单的情况下使用以下方法:

I usually use something like this in simple situations:

start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end

由于最大10个限制,所以3个组是必需的,另外7个是可选的,但这看起来并不不错",并且编写和跟踪最大限制是否为100并且匹配更加复杂将是一件痛苦的事情. 演示

3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo

到目前为止,我能做的最好的事情:

And the best I could do so far:

start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end

更短,尤其是当比赛很复杂但仍然很长时. 演示

shorter especially if the matches are complex but still long. demo

任何人都设法使其成为仅1个正则表达式的解决方案无需编程?

Anyone managed to make it work as a 1 regex-only solution without programming?

我最感兴趣的是如何在PCRE中做到这一点,但其他口味也可以.

I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.

目的是仅通过RegEx验证匹配并捕获match 0中的各个令牌,而没有任何OS/软件/编程语言限制

The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation

在@nhahtdh的帮助下,我使用\G进入了下面的RegExp:

With @nhahtdh's help I got to the RegExp below by using \G:

(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)

演示甚至更短,但无需重复代码即可描述

demo even shorter, but can be described without repeating code

我也对ECMA风格感兴趣,因为它不支持\G,想知道是否还有另一种方法,尤其是在不使用/g修饰符的情况下.

I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.

推荐答案

请先阅读!

这篇文章是为了展示可能性,而不是赞同一切正则表达式"方法来解决问题.作者写了3-4个变体,每个变体都有细微的错误,在找到当前解决方案之前,很难检测到.

对于您的特定示例,还有其他更可维护的更好的解决方案,例如匹配和沿定界符分割匹配.

这篇文章涉及您的特定示例.我真的怀疑是否可以进行全面概括,但是背后的想法可在类似情况下重用.

  • .NET支持使用 CaptureCollection捕获重复模式类.
  • 对于支持\G并具有后向功能的语言,我们也许可以构造一个与全局匹配功能一起使用的正则表达式.要完全正确地编写它,并且编写一个有缺陷的正则表达式并不容易.
  • 对于不具有\G和后向支持的语言:可以通过在单个匹配项后切掉输入字符串来用^模拟\G. (此答案未涵盖).
  • .NET supports capturing repeating pattern with CaptureCollection class.
  • For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
  • For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).

此解决方案假定正则表达式引擎支持\G匹配边界,超前(?=pattern)和后向(?<=pattern). Java,Perl,PCRE,.NET和Ruby regex风味均支持上述所有高级功能.

This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.

但是,您可以在.NET中使用正则表达式.由于.NET支持捕获由捕获组匹配的所有实例,该捕获组通过 CaptureCollection 类.

However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.

对于您的情况,可以使用\G匹配边界在一个正则表达式中完成,并提前进行约束以限制重复次数:

For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:

(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)

演示 .重复构造\w+-,然后是\w+:end.

(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)

演示 .第一项的构造为\w+,然后重复-\w+. (感谢kaᵠ的建议).这种构造更容易说明其正确性,因为交替次数较少.

DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.

\G匹配边界在您需要进行标记化时特别有用,在这种情况下,您需要确保引擎不会跳过并匹配本应无效的内容.

\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.

让我们分解正则表达式:

Let us break down the regex:

(?:
  start:(?=\w+(?:-\w+){2,9}:end)
    |
  (?<=-)\G
)
(\w+)
(?:-|:end)

最容易识别的部分是最后一行之前的(\w+),这是您要捕获的单词.

The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.

最后一行也很容易辨认:要匹配的单词后面可以紧跟-:end.

The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.

我允许正则表达式自由开始匹配字符串中的任何地方.换句话说,start:...:end可以出现在字符串中的任意位置,并且可以出现多次.正则表达式将简单地匹配所有单词.您只需要处理返回的数组,以将匹配的标记实际来自何处分开.

I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.

至于解释,正则表达式的开头将检查字符串start:的存在,随后的超前检查将检查单词数是否在指定的限制内,并以:end结尾. 或者,或者我们检查上一场比赛之前的字符是-,然后从上一场比赛继续.

As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.

对于其他构造:

(?:
  start:(?=\w+(?:-\w+){2,9}:end)
    |
  (?!^)\G-
)
(\w+)

所有内容几乎相同,除了我们先匹配start:\w+,然后再匹配形式为-\w+的重复项.与第一种构造相反,在这种构造中,我们首先匹配start:\w+-,并匹配\w+-的重复实例(或最后一次重复的\w+:end).

Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).

要使此正则表达式可以在字符串中间进行匹配是非常棘手的:

It is quite tricky to make this regex works for matching in middle of the string:

  • 我们需要检查start::end之间的字数(这是原始正则表达式要求的一部分).

  • We need to check the number of words between start: and :end (as part of the requirement of the original regex).

\G也匹配字符串的开头! (?!^)是必需的,以防止出现此现象.如果不考虑这一点,则当没有任何start:时,正则表达式可能会产生匹配项.

\G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.

对于第一种构造,后面的(?<=-)已经可以防止这种情况发生((?<=-)(?<=-)暗示).

For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).

对于第一个结构(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end),我们需要确保在:end之后没有匹配任何有趣的内容.后面是为了这个目的:它防止:end之后的任何垃圾都匹配.

For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.

第二种构造不会遇到此问题,因为在匹配所有令牌之间的所有令牌后,我们将陷入(c22>的):.

The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.

如果要验证输入字符串遵循的格式(前后没有多余的东西),提取数据,则可以这样添加锚点:

If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:

(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)

(也不需要向后看,但我们仍然需要(?!^)来防止\G匹配字符串的开头).

(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).

对于要捕获重复的所有实例的所有问题,我认为不存在修改regex的通用方法.转换为硬"(或不可能?)情况的一个例子是,重复必须回溯一个或多个循环来满足特定条件才能匹配.

For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.

当原始正则表达式描述整个输入字符串(验证类型)时,与尝试从字符串中间进行匹配的正则表达式(匹配类型)相比,转换起来通常更容易.但是,您始终可以与原始正则表达式进行匹配,然后将匹配类型问题转换回验证类型问题.

When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.

我们通过执行以下步骤来构建这样的正则表达式:

We build such regex by going through these steps:

  • 在重复之前写一个覆盖该部分的正则表达式(例如start:).让我们称之为 prefix regex .
  • 匹配并捕获第一个实例. (例如(\w+))
    (此时,第一个实例和定界符应已匹配)
  • 添加\G作为替代.通常还需要防止它与字符串的开头匹配.
  • 添加定界符(如果有). (例如-)
    (在此步骤之后,除最后一个令牌外,其余令牌也应已匹配)
  • 添加重复后覆盖零件的零件(如有必要)(例如:end).让我们称呼重复后缀后缀regex 的部分(是否将其添加到构造中无关紧要).
  • 现在是最困难的部分.您需要检查以下内容:
    • 除了 prefix regex 外,没有其他方法可以开始比赛.记下\G分支.
    • 后缀正则表达式匹配后,无法开始任何匹配.注意\G分支是如何开始比赛的.
    • 对于第一种构造,如果将后缀正则表达式(例如:end)与定界符(例如-)交替使用,请确保最终不要将后缀正则表达式用作定界符.
    • Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
    • Match and capture the first instance. (e.g. (\w+))
      (At this point, the first instance and delimiter should have been matched)
    • Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
    • Add the delimiter (if any). (e.g. -)
      (After this step, the rest of the tokens should have also been matched, except the last maybe)
    • Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
    • Now the hard part. You need to check that:
      • There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
      • There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
      • For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.

      这篇关于折叠并捕获单个Regex表达式中的重复模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆