正则表达式:捕获组中的捕获组 [英] Regex: capturing groups within capture groups

查看:102
本文介绍了正则表达式:捕获组中的捕获组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简介

(如果您对介绍无聊,可以跳到如果... 怎么办)

这个问题不是特别针对VBScript的(在这种情况下,我只是使用了它):我想找到一种使用常规正则表达式的解决方案(包括编辑器).

当我想创建 示例4的示例时,其中3个捕获组用于将数据拆分为3个单元格在MS Excel中 . 我需要捕获一个完整的模式,然后在其中捕获其他三个模式.但是,在同一个表达式中,我还需要捕获另一种模式,并再次捕获其中的3种其他模式(是的,我知道...但是在指着裸露的手指之前,请先完成阅读).

我首先想到了命名捕获组,然后我意识到我不应该«混合命名和编号的捕获组»,因为不推荐使用«,因为口味在组的编号方式上不一致».

然后我查看了 VBScript子匹配

这是正则表达式的Rubular中的演示. 在这些内容中:

124; 12; 3
我的ID1:213我的ID2:232我的字:ins4yanrgx
:8587459:18254182540215:dcpt
0; 1; 2

它返回带有数字的前2个单元格,并返回带有数字或单词的3 rd . 基本上,我使用了一个非捕获组,其中包含2个父"模式(父" =广泛的模式,我想检测其他子模式).如果1 st 父模式具有匹配的子模式(1 st 捕获组),则将其值和该模式的其余捕获组放在3个单元格中.如果没有,我检查第4个 捕获组(属于第2个 nd 父模式)是否匹配,并将其余的子模式放在相同的3个单元格中. /p>

如果...

而不是像这样:

(?:^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever))

像这样的事情是可能的:

(#:^(\d+);(\d+);(\d+)$)|(#:^.*:(\d+)\s.*:(\d+).*:(\w+)$)|(#:what(ever))

(#:而不是创建 non-captureing 组的情况下,将创建一个编号为父"的捕获组. 这样,我可以做类似于示例4 :

C.Offset(0, 1) = regEx.Replace(strInput, "#$1")
C.Offset(0, 2) = regEx.Replace(strInput, "#$2")
C.Offset(0, 3) = regEx.Replace(strInput, "#$3")

它将搜索父模式,直到找到子模式中的匹配项为止(将返回第一个匹配项,理想情况下,将不搜索其余匹配项).

已经有这样的东西了吗?还是我正好从正则表达式中完全丢失了允许执行此操作的东西?

其他可能的变体:

  • 直接引用父模式和子模式,例如:#2$3(在我的示例中,这等效于$6);
  • 根据需要在其他组中创建尽可能多的捕获组(我想这会更复杂,但同时也是最有趣的部分),例如:使用正则表达式(相同的语法),例如(#:^_(?:(#:(\d+):\w+-(\d))|(#:\w+:(\d+)-(\d+)))_$)|(#:^\w+:\s+(#:(\w+);\d-(\d+))$)并以类似的方式获取##$1 :

    _123:smt-4_它将匹配于: 123
    _ott:432-10_它将匹配于: 432
    yant: special;3-45235它将与以下内容匹配:特殊

如果您发现此逻辑中有任何错误或缺陷,请告诉我,我将尽快进行编辑.

解决方案

通常会捕获几乎相同的数据.
唯一的区别在于形式.

有一个用于分支重置的正则表达式构造.
它在大多数与Perl兼容的引擎上提供.不是Java也不是Dot Net.
它主要是节省正则表达式资源并使其更容易处理匹配项.

您提到的替代方法无济于事,它实际上只是使用
更多资源.您仍然必须查看匹配项以了解您的位置.
但是您只需要检查集群中的一个组就可以知道哪个组
组有效(<-,如果使用分支复位,则没有必要).

(下面是使用 RegexFormat 6 )构造的

这是分支重置版本:

 # (?|^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever)()())

 (?|
      ^ 
      ( \d+ )                       # (1)
      ;
      ( \d+ )                       # (2)
      ;
      ( \d+ )                       # (3)
      $ 
   |  
      ^ .* :
      ( \d+ )                       # (1)
      \s .* :
      ( \d+ )                       # (2)
      .* :
      ( \w+ )                       # (3)
      $ 
   |  
      what
      ( ever )                      # (1)
      ( )                           # (2)
      ( )                           # (3)
 )

这是您的两个正则表达式.请注意,父级"捕获实际上增加了组的数量(这降低了引擎的速度):

 # (?:^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever))

 (?:
      ^ 
      ( \d+ )                       # (1)
      ;
      ( \d+ )                       # (2)
      ;
      ( \d+ )                       # (3)
      $ 
   |  
      ^ .* :
      ( \d+ )                       # (4)
      \s .* :
      ( \d+ )                       # (5)
      .* :
      ( \w+ )                       # (6)
      $ 
   |  
      what
      ( ever )                      # (7)
 )

    # (#:^(\d+);(\d+);(\d+)$)|(#:^.*:(\d+)\s.*:(\d+).*:(\w+)$)|(#:what(ever))

    (                             # (1 start)
         \#: ^ 
         ( \d+ )                       # (2)
         ;
         ( \d+ )                       # (3)
         ;
         ( \d+ )                       # (4)
         $ 
    )                             # (1 end)
 |  
    (                             # (5 start)
         \#: ^ .* :
         ( \d+ )                       # (6)
         \s .* :
         ( \d+ )                       # (7)
         .* :
         ( \w+ )                       # (8)
         $ 
    )                             # (5 end)
 |  
    (                             # (9 start)
         \#:what
         ( ever )                      # (10)
    )                             # (9 end)

Intro

(you can skip to What if... if you get bored with intros)

This question is not directed to VBScript particularly (I just used it in this case): I want to find a solution for general regular expressions usage (editors included).

This started when I wanted to create an adaptation of Example 4 where 3 capture groups are used to split data across 3 cells in MS Excel. I needed to capture one entire pattern and then, within it, capture 3 other patterns. However, in the same expression, I also needed to capture another kind of pattern and again capture 3 other patterns within it (yeah I know... but before pointing the nutjob finger, please finish reading).

I thought first of Named Capturing Groups then I realized that I should not «mix named and numbered capturing groups» since it «is not recommended because flavors are inconsistent in how the groups are numbered».

Then I looked into VBScript SubMatches and «non-capturing» groups and I got a working solution for a specific case:

For Each C In Myrange
    strPattern = "(?:^([0-9]+);([0-9]+);([0-9]+)$|^.*:([0-9]+)\s.*:([0-9]+).*:([a-zA-Z0-9]+)$)"

    If strPattern <> "" Then
        strInput = C.Value

        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = False
            .Pattern = strPattern
        End With

        Set rgxMatches = regEx.Execute(strInput)

        For Each mtx In rgxMatches
            If mtx.SubMatches(0) <> "" Then
                C.Offset(0, 1) = mtx.SubMatches(0)
                C.Offset(0, 2) = mtx.SubMatches(1)
                C.Offset(0, 3) = mtx.SubMatches(2)
            ElseIf mtx.SubMatches(3) <> "" Then
                C.Offset(0, 1) = mtx.SubMatches(3)
                C.Offset(0, 2) = mtx.SubMatches(4)
                C.Offset(0, 3) = mtx.SubMatches(5)
            Else
                C.Offset(0, 1) = "(Not matched)"
            End If
        Next
    End If
Next

Here's a demo in Rubular of the regex. In these:

124;12;3
my id1:213 my id2:232 my word:ins4yanrgx
:8587459 :18254182540215 :dcpt
0;1;2

It returns the first 2 cells with numbers and the 3rd with a number or a word. Basically I used a non-capturing group with 2 "parent" patterns ("parents" = broad patterns where I want to detect other sub-patterns). If the 1st parent pattern has a matching sub-pattern (1st capture group) then I place its value and the remaining captured groups of this pattern in the 3 cells. If not, I check if the 4th capture group (belonging to the 2nd parent pattern) was matched and place the remaining sub-patterns in the same 3 cells.

What if...

Instead of having something like this:

(?:^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever))

Something like this could be possible:

(#:^(\d+);(\d+);(\d+)$)|(#:^.*:(\d+)\s.*:(\d+).*:(\w+)$)|(#:what(ever))

Where (#: instead of creating a non-capturing group, would create a "parent" numbered capture group. In this way I could do something similar to Example 4:

C.Offset(0, 1) = regEx.Replace(strInput, "#$1")
C.Offset(0, 2) = regEx.Replace(strInput, "#$2")
C.Offset(0, 3) = regEx.Replace(strInput, "#$3")

It would search parent patterns until it finds a match in a child pattern (the first match would be returned and, ideally, wouldn't search the remaining ones).

Is there something like this already? Or am I missing something entirely from regex that allows to do this?

Other possible variations:

  • refer to the parent and child pattern directly, e.g.: #2$3 (this would be equivalent of $6 in my example);
  • create as many capturing groups as necessary within others (I guess it would be more complex but also the most interesting part as well), e.g.: with regex (same syntax) like (#:^_(?:(#:(\d+):\w+-(\d))|(#:\w+:(\d+)-(\d+)))_$)|(#:^\w+:\s+(#:(\w+);\d-(\d+))$) and fetching ##$1 in patterns like:

    _123:smt-4_ it would match in: 123
    _ott:432-10_ it would match in: 432
    yant: special;3-45235 it would match in: special

Please tell me if you noticed any mistakes or flaws in this logic, I will edit asap.

解决方案

This is usually the case where mostly the same data is to be captured.
The only difference is in form.

There is a regex construct for that called Branch Reset.
Its offered on most Perl compatible engine's. Not Java nor Dot Net.
It mostly just saves regex resources and makes it easier to handle matches.

The alternative you mention will not help in any way, it actually just uses
more resources. You still have to see what matched to see where you are.
But you only have to check one group within a cluster to tell which other
groups are valid (<- this is unnecessary if using branch reset).

(below was constructed using RegexFormat 6)

Here is the branch reset version:

 # (?|^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever)()())

 (?|
      ^ 
      ( \d+ )                       # (1)
      ;
      ( \d+ )                       # (2)
      ;
      ( \d+ )                       # (3)
      $ 
   |  
      ^ .* :
      ( \d+ )                       # (1)
      \s .* :
      ( \d+ )                       # (2)
      .* :
      ( \w+ )                       # (3)
      $ 
   |  
      what
      ( ever )                      # (1)
      ( )                           # (2)
      ( )                           # (3)
 )

Here is your two regexes. Notice the 'parent' capturing actually increases the number of groups (which slows down the engine):

 # (?:^(\d+);(\d+);(\d+)$|^.*:(\d+)\s.*:(\d+).*:(\w+)$|what(ever))

 (?:
      ^ 
      ( \d+ )                       # (1)
      ;
      ( \d+ )                       # (2)
      ;
      ( \d+ )                       # (3)
      $ 
   |  
      ^ .* :
      ( \d+ )                       # (4)
      \s .* :
      ( \d+ )                       # (5)
      .* :
      ( \w+ )                       # (6)
      $ 
   |  
      what
      ( ever )                      # (7)
 )

and

    # (#:^(\d+);(\d+);(\d+)$)|(#:^.*:(\d+)\s.*:(\d+).*:(\w+)$)|(#:what(ever))

    (                             # (1 start)
         \#: ^ 
         ( \d+ )                       # (2)
         ;
         ( \d+ )                       # (3)
         ;
         ( \d+ )                       # (4)
         $ 
    )                             # (1 end)
 |  
    (                             # (5 start)
         \#: ^ .* :
         ( \d+ )                       # (6)
         \s .* :
         ( \d+ )                       # (7)
         .* :
         ( \w+ )                       # (8)
         $ 
    )                             # (5 end)
 |  
    (                             # (9 start)
         \#:what
         ( ever )                      # (10)
    )                             # (9 end)

这篇关于正则表达式:捕获组中的捕获组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆