python正则表达式每个搜索字符串索引匹配多次 [英] python regex match more than once per index of search string

查看:666
本文介绍了python正则表达式每个搜索字符串索引匹配多次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种使python re模块或较新的regex模块的finditer函数与特定模式的所有可能变体(重叠或其他)匹配的方法.我知道可以使用前瞻性查询来匹配而不消耗搜索字符串,但是每个索引仍然只能得到一个正则表达式,在这里我可以得到多个正则表达式.

I'm looking for a way to make the finditer function of the python re module or the newer regex module to match all possible variations of a particular pattern, overlapping or otherwise. I am aware of using lookaheads to get matches without consuming the search string, but I still only get one regex per index, where I could get more than one.

我正在使用的正则表达式是这样的:

The regex I am using is something like this:

(?=A{2}[BA]{1,6}A{2})

所以在字符串中:

AABAABBAA

它应该能够匹配:

AABAA AABAABBAA AABBAA

,但目前只能匹配其中的最后两个.我意识到这与[BA]{1,6}的贪婪有关.有没有办法使正则表达式匹配从最懒到最贪婪的所有模式?

but currently it will only match the last two of these. I realise it is to do with the greediness of the [BA]{1,6}. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?

推荐答案

I realise it is to do with the greediness of the [BA]{1,6}. Is there a way to make the regex match everything from the laziest to the greediest possible pattern?

问题是双重的.

1. Regex engines will only match once at a character position.
2. There is not a regex construct of between lazy and greedy  
   it's either one or the other.  

暂时跳过问题1 ..

Skipping problem 1. for the moment..,

问题2 :
可能存在{1,6} 1,2,3,4,5或6个匹配项的情况
给定位置的构造(字符)的数量.

Problem 2:
There could be a case where there is {1,6} 1,2,3,4,5 or 6 matches
of a construct (character) at a given position.

要解决该问题,您必须指定独立的{1},{2},{3},{4},{5},{6}
作为该位置的可选替换.
显然,范围 {1,6}无法正常工作.

To solve that problem, you'd have to specify independent {1},{2},{3},{4},{5},{6}
as optional alternations at that position.
Clearly a range {1,6} is not going to work.

范围而言,可以指定查找
通过添加惰性修饰符这样的最小量{1,6}?
但这只会找到最小的数量,不会多也不会少.

As far as a Range is concerned, it can be specified to find the
minimum amount by adding the lazy modifier as such {1,6}?
But this will only find the smallest amount it can, no more, no less.

最后,

问题1 :
当正则表达式引擎匹配时,它会始终将当前位置向前推进

等于最后一场比赛的长度.
在匹配零长度断言的情况下,它会人为地增加
向前一个字符的位置.

Problem 1:
When a regex engine matches, it always advances the current position forward
an amount equal to the length of the last match.
In the case of a matched zero-length assertion, it artificially increases
the position one character forward.

因此,鉴于这两个问题,人们可以利用这些优势/劣势来实现
解决方法,并且必须承受一些副作用.

So, given these two problems, one can use these strengths/weaknesses to come
up with a workaround, and have to live with some side affects.

解决方法:
将所有可能的选择放在要分析的断言位置. 某个位置的每个匹配项都将包含保存变体的组的列表.
因此,如果您已匹配6个可能的变体组中的3个变体,则具有值的组将成为变体.

Workarounds:
Put all the possible alternatives at a position as assertions to be analyzed. Each match at a position, will contain a list of groups that hold a variation.
So, if you've matched 3 variations out of 6 possible variant groups, the groups with values will be the variants.

如果所有组都没有值,则在该位置未找到任何变体.
因为所有断言都是可选的,所以不会发生任何变体.
为避免不必要地在这些特定位置进行匹配,最终
有条件的可用于不报告这些情况. (即(?(1)|(?(2)|(?!)))等.).

If none of the groups have values, no variants were found at that position.
No variants can happen because all of the assertions are optional.
To avoid unnecessarily matching at these specific positions, a final
conditional can be used to not report these. (i.e., (?(1)|(?(2)|(?!))) etc..).

让我们以您的 range 示例为例.
我们将在最后使用条件来验证匹配的组,
但是没有它也可以完成.
_请注意,使用此 range 示例会导致与相同的
重叠 最终匹配中的值.这不会确保
处的唯一匹配 位置(此后的示例显示了如何避免这种情况).

Lets use your range example as an example.
We will use the conditional at the end to verify a group matched,
but it could be done without it.
_Note that using this range example caused an overlap with identical
values in the final match. This does not insure unique matches at
a position (the example following this shows how to avoid this).

 # (?=(A{2}[BA]{1,6}?A{2}))?(?=(A{2}[BA]{1,6}A{2}))?(?(1)|(?(2)|(?!)))

 (?=
      (                             # (1 start)
           A{2}
           [BA]{1,6}? 
           A{2} 
      )                             # (1 end)
 )?
 (?=
      (                             # (2 start)
           A{2}
           [BA]{1,6} 
           A{2} 
      )                             # (2 end)
 )?
 (?(1)
   |  (?(2)
        |  (?!)
      )
 )

输出:

 **  Grp 1 -  ( pos 0 , len 5 ) 
AABAA  
 **  Grp 2 -  ( pos 0 , len 9 ) 
AABAABBAA  

-------------

 **  Grp 1 -  ( pos 3 , len 6 ) 
AABBAA  
 **  Grp 2 -  ( pos 3 , len 6 ) 
AABBAA  


相同,但没有 range 问题.
在这里,我们显式定义了唯一的构造.
注意每个位置的唯一值.


Same, but without the range problem.
Here, we explicitly define unique constructs.
Note the unique values at each position.

 # (?=(A{2}[BA]{1}A{2}))?(?=(A{2}[BA]{2}A{2}))?(?=(A{2}[BA]{3}A{2}))?(?=(A{2}[BA]{4}A{2}))?(?=(A{2}[BA]{5}A{2}))?(?=(A{2}[BA]{6}A{2}))?(?(1)|(?(2)|(?(3)|(?(4)|(?(5)|(?(6)|(?!)))))))

 (?=
      (                             # (1 start)
           A{2}
           [BA]{1} 
           A{2} 
      )                             # (1 end)
 )?
 (?=
      (                             # (2 start)
           A{2}
           [BA]{2} 
           A{2} 
      )                             # (2 end)
 )?
 (?=
      (                             # (3 start)
           A{2}
           [BA]{3} 
           A{2} 
      )                             # (3 end)
 )?
 (?=
      (                             # (4 start)
           A{2}
           [BA]{4} 
           A{2} 
      )                             # (4 end)
 )?
 (?=
      (                             # (5 start)
           A{2}
           [BA]{5} 
           A{2} 
      )                             # (5 end)
 )?
 (?=
      (                             # (6 start)
           A{2}
           [BA]{6} 
           A{2} 
      )                             # (6 end)
 )?

 (?(1)|(?(2)|(?(3)|(?(4)|(?(5)|(?(6)|(?!)))))))

输出:

 **  Grp 1 -  ( pos 0 , len 5 ) 
AABAA  
 **  Grp 2 -  NULL 
 **  Grp 3 -  NULL 
 **  Grp 4 -  NULL 
 **  Grp 5 -  ( pos 0 , len 9 ) 
AABAABBAA  
 **  Grp 6 -  NULL 

------------------

 **  Grp 1 -  NULL 
 **  Grp 2 -  ( pos 3 , len 6 ) 
AABBAA  
 **  Grp 3 -  NULL 
 **  Grp 4 -  NULL 
 **  Grp 5 -  NULL 
 **  Grp 6 -  NULL 

最后,您需要做的是在每场比赛中,抓住捕获组
值,然后将它们放入数组中.

Finally, all you need to do is on each match, grab the capture groups
with values, and put them into an array.

这篇关于python正则表达式每个搜索字符串索引匹配多次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆