子字符串不匹配且不匹配 [英] Ambiguous substring with mismatches

查看:135
本文介绍了子字符串不匹配且不匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用正则表达式在DNA字符串中查找子字符串.该子字符串具有不明确的基数,例如ATCGR,其中R可以是AG.另外,脚本必须允许x个不匹配的数目.这是我的代码

I'm trying to use regular expressions to find a substring in a string of DNA. This substring has ambiguous bases, that like ATCGR, where R could be A or G. Also, the script must allow x number of mismatches. So this is my code

import regex

s = 'ACTGCTGAGTCGT'    
regex.findall(r"T[AG]T"+'{e<=1}', s, overlapped=True)

因此,如果不匹配,我希望有3个子字符串AC**TGC**TGAGTCGTACTGC**TGA**GTCGTACTGCTGAGT**CGT**.预期的结果应该是这样的:

So, with one mismatch I would expect 3 substrings AC**TGC**TGAGTCGT and ACTGC**TGA**GTCGT and ACTGCTGAGT**CGT**. The expected result should be like this:

['TGC', 'TGA', 'AGT', 'CGT']

但是输出是

['TGC', 'TGA']

即使使用re.findall,代码也无法识别最后一个子字符串. 另一方面,如果代码设置为允许与{e&== 2}出现2个不匹配,则输出为

Even using re.findall, the code doesn't recognize the last substring. On the other hand, if the code is setting to allow 2 mismatches with {e<=2}, the output is

['TGC', 'TGA']

还有另一种获取所有子字符串的方法吗?

Is there another way to get all the substrings?

推荐答案

如果我理解得很好,您正在寻找与模式T[GA]T匹配的所有三个字母子字符串,并且最糟糕的情况是允许出现一个错误,但我认为您要查找的错误只是一个字符替换,因为您从未讲过2个字母的结果.

If I understand well, you are looking for all three letters substrings that match the pattern T[GA]T and you allow at worst one error, but I think the error you are looking for is only a character substitution since you never spoke about 2 letters results.

要获得预期的结果,必须将{e<=1}更改为{s<=1} (或{s<2}),并将其应用于整个模式(不仅是最后一个字母)分组(捕获或不捕获,如您所愿),否则谓词{s<=1}仅链接到最后一个字母:

To obtain the expected result, you have to change {e<=1} to {s<=1} (or {s<2}) and to apply it to the whole pattern (and not only the last letter) enclosing it in a group (capturing or not capturing, like you want), otherwise the predicate {s<=1} is only linked to the last letter:

regex.findall(r'(T[AG]T){s<=1}', s, overlapped=True)

这篇关于子字符串不匹配且不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆