Python:多个共识序列 [英] Python: Multiple Consensus sequences

查看:197
本文介绍了Python:多个共识序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从dna序列列表开始,我必须返回所有可能的共识(结果是 每个位置的核苷酸频率最高的序列).如果在某些位置,核苷酸具有 在相同的最高频率下,我必须获得所有具有最高频率的组合. 我还必须返回配置文件矩阵(每个序列每个核苷酸的频率矩阵).

starting from a list of dna sequences, I must have in return all the possible consensus (the resulting sequence with the highest nucleotide frequency in each position) sequences. If in some positions the nucleotides have the same highest frequency, I must obtain all possible combinations with the highest frequency. I also must have in return the profile matrix ( a matrix with the frequencies of each nucleotide for each sequence).

到目前为止,这是我的代码(但它仅返回一个共识序列):

This is my code so far (but it returns only one consensus sequence):

seqList = ['TTCAAGCT','TGGCAACT','TTGGATCT','TAGCAACC','TTGGAACT','ATGCCATT','ATGGCACT']
n = len(seqList[0])
profile = { 'T':[0]*n,'G':[0]*n ,'C':[0]*n,'A':[0]*n }

for seq in seqList:

    for i, char in enumerate(seq):
        profile[char][i] += 1



consensus = ""
for i in range(n):
    max_count = 0
    max_nt = 'x'
    for nt in "ACGT":
        if profile[nt][i] > max_count:
            max_count = profile[nt][i]
            max_nt = nt
    consensus += max_nt
print(consensus)
for key, value in profile.items():
     print(key,':', " ".join([str(x) for x in value] ))

TTGCAACT
C : 0 0 1 3 2 0 6 1
A : 2 1 0 1 5 5 0 0
G : 0 1 6 3 0 1 0 0
T : 5 5 0 0 0 1 1 6

(如您所见,在第4位,C和G得分最高,这意味着我必须获得两个共有序列)

(As you can see, in position four, C and G have the same highest score, it means I must obtain two consensus sequences)

是否可以修改此代码以获得 所有可能的序列,或者您能为我解释一下如何获得正确结果的逻辑(伪代码)?

Is it possible to modify this code to obtain all the possible sequences, or could you explain me the logic (the pseudocode) how to obtain the right result?

非常感谢您!

推荐答案

我确信还有更好的方法,但这是一个简单的方法:

I'm sure there are better ways but this is a simple one:

bestseqs = [[]]
for i in range(n):
    d = {N:profile[N][i] for N in ['T','G','C','A']}
    m = max(d.values())
    l = [N for N in ['T','G','C','A'] if d[N] == m]
    bestseqs = [ s+[N] for N in l for s in bestseqs ]

for s in bestseqs:
    print(''.join(s))

# output:
ATGGAACT
ATGCAACT

这篇关于Python:多个共识序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆