子字符串不匹配且不匹配 [英] Ambiguous substring with mismatches
问题描述
我正在尝试使用正则表达式在DNA字符串中查找子字符串.该子字符串具有不明确的基数,例如ATCGR
,其中R
可以是A
或G
.另外,脚本必须允许x
个不匹配的数目.这是我的代码
I'm trying to use regular expressions to find a substring in a string of DNA. This substring has ambiguous bases, that like ATCGR
, where R
could be A
or G
. Also, the script must allow x
number of mismatches. So this is my code
import regex
s = 'ACTGCTGAGTCGT'
regex.findall(r"T[AG]T"+'{e<=1}', s, overlapped=True)
因此,如果不匹配,我希望有3个子字符串AC**TGC**TGAGTCGT
和ACTGC**TGA**GTCGT
和ACTGCTGAGT**CGT**
.预期的结果应该是这样的:
So, with one mismatch I would expect 3 substrings AC**TGC**TGAGTCGT
and ACTGC**TGA**GTCGT
and ACTGCTGAGT**CGT**
. The expected result should be like this:
['TGC', 'TGA', 'AGT', 'CGT']
但是输出是
['TGC', 'TGA']
即使使用re.findall,代码也无法识别最后一个子字符串. 另一方面,如果代码设置为允许与{e&== 2}出现2个不匹配,则输出为
Even using re.findall, the code doesn't recognize the last substring. On the other hand, if the code is setting to allow 2 mismatches with {e<=2}, the output is
['TGC', 'TGA']
还有另一种获取所有子字符串的方法吗?
Is there another way to get all the substrings?
推荐答案
如果我理解得很好,您正在寻找与模式T[GA]T
匹配的所有三个字母子字符串,并且最糟糕的情况是允许出现一个错误,但我认为您要查找的错误只是一个字符替换,因为您从未讲过2个字母的结果.
If I understand well, you are looking for all three letters substrings that match the pattern T[GA]T
and you allow at worst one error, but I think the error you are looking for is only a character substitution since you never spoke about 2 letters results.
要获得预期的结果,必须将{e<=1}
更改为{s<=1}
(或{s<2}
),并将其应用于整个模式(不仅是最后一个字母)分组(捕获或不捕获,如您所愿),否则谓词{s<=1}
仅链接到最后一个字母:
To obtain the expected result, you have to change {e<=1}
to {s<=1}
(or {s<2}
) and to apply it to the whole pattern (and not only the last letter) enclosing it in a group (capturing or not capturing, like you want), otherwise the predicate {s<=1}
is only linked to the last letter:
regex.findall(r'(T[AG]T){s<=1}', s, overlapped=True)
这篇关于子字符串不匹配且不匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!