字符串正则表达式两个不匹配的Python [英] String regex two mismatches Python
问题描述
如何扩展下面的代码以允许我探索我的子字符串和父字符串之间有 2 个或更少不匹配的所有实例?
子串:SSQP
字符串匹配:SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ
这是一个仅包含一个可能不匹配的示例:
<预><代码>>>>s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'>>>re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)['SSQQ','SSQP','SSQP','SSQP','SSQP','SSQP','SSQP','SSQP','SSQP']显然,在上面的代码中合并两个不匹配的可能性需要对所有可能的组合进行大量的蛮力输入.
如何扩展此代码(或重构此代码)以探索两个不匹配的可能性?
此外,我想修改我的输出,以便我得到返回的数字索引(不是 SSQQ
或 SSQP
)子字符串与字符串匹配的确切位置.
你不必使用 re
在这里你可以使用 itertools
模块,节省大量内存.
您可以首先提取长度为 4 的所有子字符串,然后将它们与您的 substring
进行比较,然后选择那些与您的 substring
的差异小于 2 的:>
from itertools import izip,islice,teedef sub_findre(s,substring,diffnumber):sublen=len(子字符串)zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s)))对于 zip_gen 中的 z:l,z=tee(z)if sum(1 for i,j in l if i==j)>=sublen-diffnumber:新 = izip(*z)下一个(新)产量 ''.join(next(new))
演示:
s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'子串=SSQP"打印列表(sub_findre(s,substring,2))['SSPQ','SPQQ','QQQP','SSSS','SSSQ','SSQQ','SQQQ','SSQP','PSQS','SSQP','SSQP','SQPP','SSSS','SSSQ','SSQP','PSQS','SSQP','SSSS','SSSQ','SSQP','PSQS','SSQP','SSSS','SSSQ','SSQP', 'PSQ']
如果你想返回索引,你需要把索引放在 izip
中,你可以使用 itertools.repeat()
重复长度为 <代码>子串代码>:
from itertools import izip,islice,tee,repeatdef sub_findre(s,substring,diffnumber):sublen=len(子字符串)zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s)))对于 zip_gen 中的 z:l,z=tee(z)if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber:新 = izip(*z)下一个(新)下一个(新)产量下一个(新)[0]
演示:
打印列表(sub_findre(s,substring,2))[0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67]
How can I extend the code below to allow me to explore all instances where I have 2 mismatches or less between my substring and the parent string?
Substring: SSQP
String-to-match-to: SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ
Here is an example where only one possible mismatch is incorporated:
>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
>>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)
['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP']
Obviously, incorporating the possibility of two mismatches in the code above would require a lot of brute-force typing of all the possible combinations.
How can I extend this code (or refactor this code) to explore the possibility of two mismatches?
Furthermore, I want to modify my output so that I get the numeric index returned (not SSQQ
or SSQP
) of the exact position the substring matched the string.
You don't have to use re
here you can use itertools
module instead and save a lot of memory.
You can first extract all sub-strings with length 4 then compare them with your substring
and just select those that have less that 2 difference with your substring
:
from itertools import izip,islice,tee
def sub_findre(s,substring,diffnumber):
sublen=len(substring)
zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s)))
for z in zip_gen:
l,z=tee(z)
if sum(1 for i,j in l if i==j)>=sublen-diffnumber:
new=izip(*z)
next(new)
yield ''.join(next(new))
Demo:
s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
substring='SSQP'
print list(sub_findre(s,substring,2))
['SSPQ', 'SPQQ', 'QQQP', 'SSSS', 'SSSQ', 'SSQQ', 'SQQQ', 'SSQP', 'PSQS', 'SSQP', 'SSQP', 'SQPP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQ']
If you want to return the indices you need to put the indices in izip
which you can use itertools.repeat()
to repeat the index with the length of substring
:
from itertools import izip,islice,tee,repeat
def sub_findre(s,substring,diffnumber):
sublen=len(substring)
zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s)))
for z in zip_gen:
l,z=tee(z)
if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber:
new=izip(*z)
next(new)
next(new)
yield next(new)[0]
Demo:
print list(sub_findre(s,substring,2))
[0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67]
这篇关于字符串正则表达式两个不匹配的Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!