识别 2 个字符串之间的短匹配序列 [英] Identify short sequences of matches between 2 strings
本文介绍了识别 2 个字符串之间的短匹配序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有以下代码:
for k, (j,k) in enumerate (zip(line1_u,line2_u_rev_comp)):
if j==k:
Match1+=1
if j== 'N' or k == 'N':
Unknown1+=1
if j != k:
Different1+=1
这需要 2 行(line1_u 和 line2_u_rev_comp)并逐个字符地比较它们以识别它们是否匹配,是否具有将其置于未知类别中或不同的 N.我想要的是统计这些中的每一个,以确定一行中是否有 10 个或更多字符匹配.这怎么可能?代码的解释将不胜感激.
This takes 2 lines (line1_u and line2_u_rev_comp) and compares them character by character to identify if they match, have an N which places it in the unknown category or are different. What I want is as well as tallying up each of these is to identify if 10 characters or more in a row match. How could this be done? Explanation of code would be greatly appreciated.
推荐答案
您应该查看 itertools.groupby:
from collections import defaultdict
from itertools import groupby
def class_chars(chrs):
if 'N' in chrs:
return 'unknown'
elif chrs[0] == chrs[1]:
return 'match'
else:
return 'not_match'
s1 = 'aaaaaaaaaaN123bbbbbbbbbbQccc'
s2 = 'aaaaaaaaaaN456bbbbbbbbbbPccc'
n = 0
consec_matches = []
chars = defaultdict(int)
for k, group in groupby(zip(s1, s2), class_chars):
elems = len(list(group))
chars[k] += elems
if k == 'match':
consec_matches.append((n, n+elems-1))
n += elems
print chars
print consec_matches
print [x for x in consec_matches if x[1]-x[0] >= 9]
输出:
defaultdict(<type 'int'>, {'not_match': 4, 'unknown': 1, 'match': 23})
[(0, 9), (14, 23), (25, 27)]
[(0, 9), (14, 23)]
这篇关于识别 2 个字符串之间的短匹配序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文