重叠正则表达式匹配 [英] Overlapping regex matches

查看:49
本文介绍了重叠正则表达式匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建以下正则表达式:返回 AUG 和 (UAGUGAUAA 之间的字符串) 来自以下 RNA 字符串:AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG,以便找到所有匹配项,包括重叠的匹配项.

I'm trying to create the following regular expression: return a string between AUG and (UAG or UGA or UAA) from a following RNA string: AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG, so that all matches would be found, including the overlapping ones.

我尝试了几个正则表达式,结果是这样的:

I've tried several regexes, ending up with something like that:

matches = re.findall('(?=AUG)(\w+)(?=UAG|UGA|UAA)',"AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG")

你能告诉我正则表达式中的错误吗?

Could you show me the errors in my regex pattern?

推荐答案

用一个正则表达式做到这一点实际上非常困难,因为大多数用途特别想要重叠匹配.但是,您可以通过一些简单的迭代来做到这一点:

Doing this with one regular expression is actually pretty difficult, as most uses specifically don't want overlapping matches. You could, however, do this with some simple iteration:

regex = re.compile('(?=AUG)(\w+)(?=UAG|UGA|UAA)');
RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
matches = []
tmp = RNA
while (match = regex.search(tmp)):
    matches.append(match)
    tmp = tmp[match.start()-2:]  #Back up two to get the UG portion.  Shouldn't matter, but safer.

for m in matches:
    print m.group(0)

不过,这有一些问题.在 AUGUAGUGAUAA 的情况下,您期望回报是多少?是否有两个字符串要返回?还是只有一个?现在,您的正则表达式甚至无法捕获 UAG,因为它会继续匹配 UAGUGA 并在 UAA 处被截断.为了解决这个问题,您可能希望使用 ? 运算符使您的运算符变得懒惰 - 这种方法将无法捕获更长的子字符串.

Though, this has some problems. What would you expect the return to be in the case of AUGUAGUGAUAA? Are there two strings to be returned? Or just one? Right now, your regular expression isn't even capable of capturing UAG, as it continues on through to match UAGUGA and get cut off at UAA. To combat this, then, you might wish to use the ? operator to make your operator lazy - an approach that then will be unable to capture the longer substring.

也许对字符串进行两次迭代就是答案,但是如果您的 RNA 序列包含 AUGAUGUAGUGAUAA 呢?那里的正确行为是什么?

Maybe iteration over the string twice is the answer, but then what if your RNA Sequence contains AUGAUGUAGUGAUAA? What's the correct behaviour there?

通过迭代字符串及其子字符串,我可能更喜欢无正则表达式的方法:

I might favor a regular expression free approach, by iterating over the string and its substrings:

RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0

while (RNA.find('AUG', start) > -1):
    start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
    candidates.append(RNA[start+3:])
    start += 1

matches = []

for candidate in candidates:
    for terminator in ['UAG', 'UGA', 'UAA']:
        end = 1;
        while(candidate.find(terminator, end) > -1):
            end = candidate.find(terminator, end)
            matches.append(candidate[:end])
            end += 1

for match in matches:
    print match

这样,您肯定会获得所有匹配项,无论如何.

This way, you're sure to get all matches, no matter what.

如果您需要跟踪每个匹配的位置,您可以修改您的候选数据结构以使用保持起始位置的元组:

If you need to keep track of the position of each match, you can modify your candidates data structure to use tuples which maintain the starting position:

RNA = 'AGCCAUGUAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAGUAGCAUCUCAG'
candidates = []
start = 0

while (RNA.find('AUG', start) > -1):
    start = RNA.find('AUG', start) #Confound python and its lack of assignment returns
    candidates.append((RNA[start+3:], start+3))
    start += 1

matches = []

for candidate in candidates:
    for terminator in ['UAG', 'UGA', 'UAA']:
        end = 1;
        while(candidate[0].find(terminator, end) > -1):
            end = candidate[0].find(terminator, end)
            matches.append((candidate[1], candidate[1] + end, candidate[0][:end]))
            end += 1

for match in matches:
    print "%d - %d: %s" % match

打印:

7 - 49: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
7 - 85: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
7 - 31: UAGCUAACUCAGGUUACAUGGGGA
7 - 72: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
7 - 76: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
7 - 11: UAGC
7 - 66: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
27 - 49: GGGAUGACCCCGCGACUUGGAU
27 - 85: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
27 - 31: GGGA
27 - 72: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
27 - 76: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
27 - 66: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
33 - 49: ACCCCGCGACUUGGAU
33 - 85: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
33 - 72: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
33 - 76: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
33 - 66: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
78 - 85: AUCCGAG

该死,实际上还有三行,您甚至可以根据它们在 RNA 序列中的位置对匹配项进行排序:

Hell, with literally three more lines, you can even sort the matches according to where they fall in the RNA sequence:

from operator import itemgetter
matches.sort(key=itemgetter(1))
matches.sort(key=itemgetter(0)) 

放置在最终印刷品之前的那个让你感到:

That placed before the final print nets you:

007 - 011: UAGC
007 - 031: UAGCUAACUCAGGUUACAUGGGGA
007 - 049: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAU
007 - 066: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
007 - 072: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
007 - 076: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
007 - 085: UAGCUAACUCAGGUUACAUGGGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
027 - 031: GGGA
027 - 049: GGGAUGACCCCGCGACUUGGAU
027 - 066: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
027 - 072: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
027 - 076: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
027 - 085: GGGAUGACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
033 - 049: ACCCCGCGACUUGGAU
033 - 066: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAA
033 - 072: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCC
033 - 076: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAA
033 - 085: ACCCCGCGACUUGGAUUAGAGUCUCUUUUGGAAUAAGCCUGAAUGAUCCGAG
078 - 085: AUCCGAG

这篇关于重叠正则表达式匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆