如何在python中合并重叠的字符串? [英] How can I merge overlapping strings in python?

查看:293
本文介绍了如何在python中合并重叠的字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些字符串,

['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']

这些字符串彼此部分重叠.如果您手动重叠它们,则会得到:

These strings partially overlap each other. If you manually overlapped them you would get:

SGALWDVPSPV

我想要一种方法,从重叠的字符串列表转到python中的最终压缩字符串.我觉得这一定是一个已经解决了的问题,并且正在努力避免重新发明轮子.我现在可以想象的方法要么是蛮力的,要么是由于使用biopython和序列比对器而变得比我想要的复杂.我有一些简单的短字符串,只是想以一种简单的方式正确地合并它们.

I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.

有人在使用python做到这一点的好方法上有任何建议吗?谢谢!

Does anyone have any advice on a nice way to do this in python? Thanks!

推荐答案

我提出的解决方案具有更具挑战性的测试列表:

My proposed solution with a more challenging test list:

#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']

for repeat in range(0, len(strFrag)-1):
    bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
    for otherStr in strFrag[1:]:
        for x in range(0,len(otherStr)):
            if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
                if len(otherStr)-x > bestMatch[0]:
                    bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
            if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
                if x > bestMatch[0]:
                    bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
    if bestMatch[0] > 2:
        strFrag[0] = bestMatch[2]
        strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]

print(strFrag)       
print(strFrag[0])

基本上,代码将每个字符串/片段与列表中的第一个字符串/片段进行比较,并找到最佳匹配项(最重叠).它逐步合并列表,合并最佳匹配项并删除单个字符串.代码假定字符串/片段之间没有不可填充的间隙(否则答案可能不会导致最长的组装.可以通过随机化起始字符串/片段来解决).还假定不存在反向补码(重叠群装配的不良假设),这将导致无意义/不可匹配的字符串/片段.我提供了一种方法来限制最低匹配要求(更改bestMatch [0]值)以防止错误匹配.最后的假设是所有匹配都是精确的.为了在组装序列时允许不匹配的灵活性,使问题变得更加复杂.我可以根据要求提供组装不匹配的解决方案.

Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.

这篇关于如何在python中合并重叠的字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆