“我如何判断字符串是否在Python中重复自身?"的更为复杂的版本. [英] A more complex version of "How can I tell if a string repeats itself in Python?"

查看:94
本文介绍了“我如何判断字符串是否在Python中重复自身?"的更为复杂的版本.的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在阅读这篇文章,我想知道是否有人可以找到将重复的图案捕捉到更复杂的弦中的方法.

I was reading this post and I wonder if someone can find the way to catch repetitive motifs into a more complex string.

例如,找到其中的所有重复图案

For example, find all the repetitive motifs in

string = 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

以下是重复的图案: 'AAAC ACGTACGT AATTCC GTGTGT CCCC TATACGTATACG TTT"

Here the repetitive motifs: 'AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'

因此,输出应该是这样的:

So, the output should be something like this:

output = {'ACGT': {'repeat': 2,
                   'region': (5,13)},
          'GT': {'repeat': 3,
                 'region': (19,24)},
          'TATACG': {'repeat': 2,
                     'region': (29,40)}}

此示例来自DNA中存在的典型生物现象,称为微卫星.

This example comes from a typical biological phenomena termed microsatellite which is present into the DNA.

更新1:从字符串变量中删除了星号.这是一个错误.

UPDATE 1: Asterisks were removed from the string variable. It was a mistake.

更新2:单字符主题不计算在内.例如:在ACGUG AAA GUC中,不考虑'A'主题.

UPDATE 2: Single character motif doesn't count. For example: in ACGUGAAAGUC, the 'A' motif is not taken into account.

推荐答案

您可以使用以下递归函数:

you can use a recursion function as following :

注意:结果参数将被视为全局变量(因为将可变对象传递给函数会影响调用方)

Note: The result argument will be treated as a global variable (because passing mutable object to the function affects the caller)

import re
def finder(st,past_ind=0,result=[]):
   m=re.search(r'(.+)\1+',st)
   if m:
      i,j=m.span()
      sub=st[i:j]
      ind = (sub+sub).find(sub, 1)
      sub=sub[:ind]
      if len(sub)>1:
        result.append([sub,(i+past_ind+1,j+past_ind+1)])
      past_ind+=j
      return finder(st[j:],past_ind)
   else:
      return result



s='AAACACGTACGTAATTCCGTGTGTCCCCTATACGTATACGTTT'
print finder(s)

结果:

[['ACGT', (5, 13)], ['GT', (19, 25)], ['TATACG', (29, 41)]]

以下字符串的上一个问题的答案:

s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'

您可以使用中提到的两个答案问题和一些额外的食谱:

You can use both answers from mentioned question and some extra recipes :

首先,您可以使用**分割字符串,然后使用r'(.+)\1+' regex创建一个包含重复字符串的新列表:

First you can split the string with ** then create a new list contain the repeated strings with r'(.+)\1+' regex :

所以结果将是:

>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> new
['AAA', 'ACGTACGT', 'TT', 'GTGTGT', 'CCCC', 'TATACGTATACG', 'TTT']

注意关于'ACGTACGT'的内容,最后错过了A

Note about 'ACGTACGT' that missed the A at the end!

然后,您可以使用principal_period的函数来获取重复的子字符串:

Then you can use principal_period's function to get the repeated sub strings :

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
... 

因此,您将在l中包含重复的字符串,而在sub中具有主字符串:

So you will have the repeated strings in l and main strings in sub :

>>> l
['ACGT', 'GT', 'TATACG']
>>> sub
['ACGTACGT', 'GTGTGT', 'TATACGTATACG']

然后您需要一个region,您可以使用span方法来完成它:

Then you need a the region that you can do it with span method :

>>> for t in sub:
...    regons.append(re.search(t,s).span())

>>> regons
[(6, 14), (24, 30), (38, 50)]

最后,您可以压缩3个列表regonsubl并使用dict理解来创建预期结果:

And at last you can zip the 3 list regon,sub,l and use a dict comprehension to create the expected result :

>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}

主要代码:

>>> s = 'AAAC**ACGTACGTA**ATTCC**GTGTGT**CCCC**TATACGTATACG**TTT'
>>> sub=[]
>>> l=[]
>>> regon=[]
>>> new=[re.search(r'(.+)\1+',i).group(0) for i in s.split('**')]
>>> for i in new:
...    p=principal_period(i)
...    if p is not None and len(p)>1:
...        l.append(p)
...        sub.append(i)
... 

>>> for t in sub:
...    regons.append(re.search(t,s).span())
... 
>>> z=zip(sub,l,regons)
>>> out={i :{'repeat':i.count(j),'region':reg} for i,j,reg in z}
>>> out
{'TATACGTATACG': {'region': (38, 50), 'repeat': 2}, 'ACGTACGT': {'region': (6, 14), 'repeat': 2}, 'GTGTGT': {'region': (24, 30), 'repeat': 3}}

这篇关于“我如何判断字符串是否在Python中重复自身?"的更为复杂的版本.的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆