字符串正则表达式两个不匹配的Python [英] String regex two mismatches Python

查看:44
本文介绍了字符串正则表达式两个不匹配的Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何扩展下面的代码以允许我探索我的子字符串和父字符串之间有 2 个或更少不匹配的所有实例?

子串:SSQP

字符串匹配:SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ

这是一个仅包含一个可能不匹配的示例:

<预><代码>>>>s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'>>>re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)['SSQQ','SSQP','SSQP','SSQP','SSQP','SSQP','SSQP','SSQP','SSQP']

显然,在上面的代码中合并两个不匹配的可能性需要对所有可能的组合进行大量的蛮力输入.

如何扩展此代码(或重构此代码)以探索两个不匹配的可能性?

此外,我想修改我的输出,以便我得到返回的数字索引(不是 SSQQSSQP)子字符串与字符串匹配的确切位置.

解决方案

你不必使用 re 在这里你可以使用 itertools 模块,节省大量内存.

您可以首先提取长度为 4 的所有子字符串,然后将它们与您的 substring 进行比较,然后选择那些与您的 substring 的差异小于 2 的:>

from itertools import izip,islice,teedef sub_findre(s,substring,diffnumber):sublen=len(子字符串)zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s)))对于 zip_gen 中的 z:l,z=tee(z)if sum(1 for i,j in l if i==j)>=sublen-diffnumber:新 = izip(*z)下一个(新)产量 ''.join(next(new))

演示:

s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'子串=SSQP"打印列表(sub_findre(s,substring,2))['SSPQ','SPQQ','QQQP','SSSS','SSSQ','SSQQ','SQQQ','SSQP','PSQS','SSQP','SSQP','SQPP','SSSS','SSSQ','SSQP','PSQS','SSQP','SSSS','SSSQ','SSQP','PSQS','SSQP','SSSS','SSSQ','SSQP', 'PSQ']

如果你想返回索引,你需要把索引放在 izip 中,你可以使用 itertools.repeat() 重复长度为 <代码>子串:

from itertools import izip,islice,tee,repeatdef sub_findre(s,substring,diffnumber):sublen=len(子字符串)zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s)))对于 zip_gen 中的 z:l,z=tee(z)if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber:新 = izip(*z)下一个(新)下一个(新)产量下一个(新)[0]

演示:

打印列表(sub_findre(s,substring,2))[0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67]

How can I extend the code below to allow me to explore all instances where I have 2 mismatches or less between my substring and the parent string?

Substring: SSQP

String-to-match-to: SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ

Here is an example where only one possible mismatch is incorporated:

>>> s = 'SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'
>>> re.findall(r'(?=(SSQP|[A-Z]SQP|S[A-Z]QP|SS[A-Z]P|SSQ[A-Z]))', s)
['SSQQ', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP', 'SSQP']

Obviously, incorporating the possibility of two mismatches in the code above would require a lot of brute-force typing of all the possible combinations.

How can I extend this code (or refactor this code) to explore the possibility of two mismatches?

Furthermore, I want to modify my output so that I get the numeric index returned (not SSQQ or SSQP) of the exact position the substring matched the string.

解决方案

You don't have to use re here you can use itertools module instead and save a lot of memory.

You can first extract all sub-strings with length 4 then compare them with your substring and just select those that have less that 2 difference with your substring :

from itertools import izip,islice,tee

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            yield ''.join(next(new))

Demo:

s='SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQ'

substring='SSQP'
print list(sub_findre(s,substring,2))

['SSPQ', 'SPQQ', 'QQQP', 'SSSS', 'SSSQ', 'SSQQ', 'SQQQ', 'SSQP', 'PSQS', 'SSQP', 'SSQP', 'SQPP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQS', 'SSQP', 'SSSS', 'SSSQ', 'SSQP', 'PSQ']

If you want to return the indices you need to put the indices in izip which you can use itertools.repeat() to repeat the index with the length of substring :

from itertools import izip,islice,tee,repeat

def sub_findre(s,substring,diffnumber):
    sublen=len(substring)
    zip_gen=(izip(substring,islice(s,i,i+sublen),repeat(i,sublen)) for i in xrange(len(s)))
    for z in zip_gen:
        l,z=tee(z)
        if sum(1 for i,j,_ in l if i==j)>=sublen-diffnumber:
            new=izip(*z)
            next(new)
            next(new)
            yield next(new)[0]

Demo:

print list(sub_findre(s,substring,2))
[0, 1, 4, 8, 9, 10, 11, 15, 20, 23, 27, 28, 32, 33, 34, 39, 42, 46, 47, 48, 53, 56, 60, 61, 62, 67]

这篇关于字符串正则表达式两个不匹配的Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆