Python正则表达式模块即使重叠= True也找不到所有匹配项 [英] Python regex module not finding all matches even with overlapping = True

查看:58
本文介绍了Python正则表达式模块即使重叠= True也找不到所有匹配项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用具有重叠匹配支持的 PyPy regex 模块.

我有以下代码,其中有一个字符串 A,我正在使用正则表达式查找在正则表达式中定义的 DNA 模式.我想找到与我的 RE 匹配的所有匹配项,包括重叠的匹配项.regex 缺少匹配项之一,我不知道如何解决.

import regex as reA = GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG"GQ_list = re.findall(r"[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]]{1,33}[G]{3,6}",A,重叠=真)

GQ_list 返回:

['GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG','GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG','GGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG','GGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG','GGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG','GGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG']

缺少"GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG" 在我的字符串 A 中并匹配正则表达式模式.这里有什么问题?我应该进行哪些更改才能使所有匹配(包括重叠匹配)成为可能?

解决方案

tdlr:在同一个起始字符串索引处可能有多个正则表达式匹配,re.findall或其他正则表达式方法只能为每个起始索引找到 1 个匹配项.你必须打破搜索才能找到它们......


您遇到的问题是正则表达式 findall 不能从每个索引中找到所有组合;它依次从每个索引中只找到一个匹配——通常是最长的匹配.查找重叠匹配项的技术仍然会错过单个字符串索引中可能的多个匹配项.您需要修改您的方法.

如果你检查你的正则表达式:

([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6})

您会注意到匹配的序列必须以最少 3 个 G 开始并以相同的序列结束.GGG[in_between_part]GGG 之间的序列短至 9 个字符,长至 84 个字符(并且可能包含与 'GGG' 相同的开始/结束序列的).

我们可以使用该信息找到符合该描述的所有可能的字符串序列.然后我们使用您的正则表达式来过滤识别出的序列确实是我们想要的序列.

首先找到每个可能的 'GGG' 的字符串索引,这是子字符串开始或结束的地方(根据定义):

s = "GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG"偏移=0指数=[]而 (s_idx:=s[offset:].find('GGG'))>-1:indicies.append(s_idx+offset)偏移+=s_idx+1>>>指数[0, 1, 8, 9, 10, 11, 21, 47, 66]# 这些是 'GGG' 的索引,可能是开始或结束# 感兴趣的子字符串.

现在我们有了字符串中每个 'GGG' 的起始索引.我们现在可以使用正则表达式和 bisect 模块来过滤正则表达式字符串中所有可能的匹配项.

我们正在使用 bisect 来查找候选结束锚的结束位置,这与起始锚点相同.bisect 模块允许我们构造一个切片,形成子字符串 a) 以 'GGG' 开头(来自 indicies 列表)和 b) 以 ' 结尾GGG' 和 c) 在开始和结束锚点之间的长度在 9 到 84 个字符之间.然后我们使用 re.fullmatch 来保证候选子串完全匹配匹配您的模式:

导入重新进口平分匹配=[]min_len=3+9max_len=3+84pat=re.compile(r'([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6})')对于指数中的 x:min_offest=bisect.bisect(indicies,x+min_len)max_offset=bisect.bisect(indicies,x+max_len)对于 indicies[min_offest:]+indicies[max_offset:] 中的 idx:候选人=s[x:idx+3]如果 pat.fullmatch(候选人):匹配.追加(候选人)

现在我们可以打印找到的所有匹配项,其索引在 s 和长度:

<预><代码>>>>对于匹配中的 ss:print((s.index(ss), len(ss)),ss)# 这只是一个原始的快捷方式.如果你想要实际的# index,当'candidate'匹配正则表达式时保存

打印所有八个唯一匹配项,包括来自相同起始索引的匹配项:

(0, 50) GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG(0, 69) GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG(1, 49) GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG(1, 68) GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG(8, 61) GGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG(9, 60) GGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG(10, 59) GGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG(11, 58) GGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG


注意:

如评论中所述,正则表达式模块 确实支持变量后视宽度.

因此,您可能会尝试这样做:

m1=regex.findall(r'([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6})', s,overlapped=True)# 产生 6 个独特的匹配m2=regex.findall(r'(?<=([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}))', s,overlapped=True)# 产生 2 个匹配项,但一个与 m1 重复

虽然此组合找到了 1 个额外的字符串,即您正在寻找的字符串,但它没有找到所有 8 个唯一匹配项.索引 1 处的字符串 GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG 丢失.

I am using the PyPy regex module with overlapping match support.

I have the following code in which I have a string A and I am looking for a DNA pattern defined in the regular expression using regex. I want to find all matches with my RE including the overlapping ones. regex is missing one of the matches and I have no idea how to fix it.

import regex as re
A = "GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG"
GQ_list = re.findall(r"[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}", A, overlapped=True)

GQ_list returns:

['GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG',
 'GGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG']

This is missing "GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG" which is in my string A and matches the regular expression pattern. What is wrong here? What changes should I make to get all matches possible including the overlapping ones?

解决方案

tdlr: With the possibility of multiple regex matches at the same starting string index, re.findall or other regex methods will only find 1 match per starting index. You have to break up the search to find them all...


The issue you have is that a regex findall does not find all combinations from every index; it finds only one match from each index in turn -- usually the longest match. The techniques that find overlapping matches will still miss the multiple matches possible from a single string index. You need to modify your approach.

If you inspect your regex:

([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6})

you will note that a matched sequence must start with a minimum of 3 G's and end with the same sequence. The sequence between the GGG[in_between_part]GGG is as short as 9 characters and as long as 84 characters (and that may contain the same starting / ending sequence of 'GGG''s).

We can use that information to find all possible string sequences that fit that description. Then we use your regex to filter that the identified sequence are indeed the ones we want.

First find the string index of every possible 'GGG' which is where a sub string would start or end (by definition):

s = "GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG"

offset=0
indicies=[]
while (s_idx:=s[offset:].find('GGG'))>-1:
    indicies.append(s_idx+offset)
    offset+=s_idx+1

>>> indicies
[0, 1, 8, 9, 10, 11, 21, 47, 66]
# these are the indicies of 'GGG' that might be that start or end
# of a sub string of interest.

Now we have the starting index of every 'GGG' in your string. We can now use a regex and the bisect module to filter for all possible matches in the string of your regex.

We are using bisect to find ending position of a candidate ending anchor, which is the same as the start anchor. The bisect module allows us to construct a slice that form sub strings that a) Start with 'GGG' (from the indicies list) and b) end with 'GGG' and c) have a length in-between the start and end anchors of between 9 and 84 characters. We then use re.fullmatch to assure the candidate substring fully matches your pattern:

import re 
import bisect 

matches=[]  
min_len=3+9
max_len=3+84
pat=re.compile(r'([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6})')
for x in indicies:
    min_offest=bisect.bisect(indicies,x+min_len)
    max_offset=bisect.bisect(indicies,x+max_len)
    for idx in indicies[min_offest:]+indicies[max_offset:]:
        candidate=s[x:idx+3]
        if pat.fullmatch(candidate):
            matches.append(candidate)

Now we can print all the matches found, with their index in s and length:

>>> for ss in matches: print((s.index(ss), len(ss)),ss)
# This is only a primitive shortcut. If you want the actual
# index, save it when 'candidate' matches the regex

Prints all eight unique matches, including from the same starting index:

(0, 50) GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG
(0, 69) GGGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(1, 49) GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG
(1, 68) GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(8, 61) GGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(9, 60) GGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(10, 59) GGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG
(11, 58) GGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGGTCCACAGCCACGGTTTGGG


Note:

As stated in comments, the regex module does support variable width lookbehinds.

You could therefore be tempted to do:

m1=regex.findall(r'([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6})', s, overlapped=True)     
# produces 6 unique matches 
m2=regex.findall(r'(?<=([G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}[ACTG]{1,33}[G]{3,6}))', s, overlapped=True)
# produces 2 matches, but one is a duplicate from m1

While this combo finds 1 extra string, the one you were looking for, it does not find all 8 unique matches. The string GGGAGAAGGGGGGCCTTCCTGGGTCCCCGAGAGTGCAGACATGCCTGGG at index 1 is missed.

这篇关于Python正则表达式模块即使重叠= True也找不到所有匹配项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆