改进DNA比对解盖的代码设计 [英] Improving code design of DNA alignment degapping
问题描述
这是有关更有效的代码设计的问题:
This is a question regarding a more efficient code design:
假定代表两个基因(gene1和gene2)的三个对齐的DNA序列(seq1,seq2和seq3;它们都是字符串).相对于比对的DNA序列,这些基因的起始和终止位置是已知的.
Assume three aligned DNA sequences (seq1, seq2 and seq3; they are each strings) that represent two genes (gene1 and gene2). Start and stop positions of these genes are known relative to the aligned DNA sequences.
# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
我希望消除比对中的缺口(即破折号),并保持基因起始和终止位置的相对关联.
I wish to remove the gaps (i.e., dashes) from the alignment and maintain the relative association of the start and stop positions of the genes.
# Desired output
align = {"seq1":"ATGCATGC",
"seq2":"ATGC",
"seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,1], "gene2":[2,3]},
"seq3":{"gene1":[0,1], "gene2":[2,3]}}
获得所需的输出看起来并不那么琐碎.下面,我针对此问题编写了一些(行编号)伪代码,但是肯定有更优雅的设计.
Obtaining the desired output is less trivial than it may seem. Below I wrote some (line-numbered) pseudocode for this problem, but surely there is a more elegant design.
1 measure length of any aligned gene # take any seq, since all seqs aligned
2 list_lengths = list of gene lengths # order is important
3 for seq in alignment
4 outseq = ""
5 for each num in range(0, length(seq)) # weird for-loop is intentional
6 if seq[num] == "-"
7 current_gene = gene whose start/stop positions include num
8 subtract 1 from length of current_gene
9 subtract 1 from lengths of all genes following current_gene in list_lengths
10 else
11 append seq[num] to outseq
12 append outseq to new variable
13 convert gene lengths into start/stop positions and append ordered to new variable
任何人都可以给我一些建议/示例,以进行更短,更直接的代码设计吗?
Can anyone give me suggestions/examples for a shorter, more direct code design?
推荐答案
此答案处理从注释到cdlanes答案的更新的annos
词典.该答案使annos
词典的seq2
gene2
索引错误[2,1].如果序列在该区域中包含所有间隔,则我建议的解决方案将从字典中删除gene
条目.还需要注意的是,如果一个基因在最后一个align
中仅包含一个字母,那么anno[geneX]
的起始和终止索引将相等->请从注释的annos
中参见seq3
gene1
.
This answer handles your updated annos
dictionary from the comment to cdlanes answer. That answer leaves the annos
dictionary with the incorrect index of [2,1] for seq2
gene2
. My proposed solution will remove the gene
entry from the dictionary if the sequence contains ALL gaps in that region. Also to note, if a gene contains only one letter in the final align
, then anno[geneX]
will have equal indices for start and stop --> See seq3
gene1
from your commented annos
.
align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}
annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}
import re
for name,anno in annos.items():
# indices of gaps removed usinig re
removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]
# removes gaps from align dictionary
align[name] = re.sub(r'-', '', align[name])
build_dna = ''
for gene,inds in anno.items():
start_ind = len(build_dna)+1
#generator to sum the num '-' removed from gene
num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])
# build the de-gapped string
build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")
end_ind = len(build_dna)
if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
del annos[name][gene] #remove the gene entry
continue
#update the values in the annos dictionary
annos[name][gene][0] = start_ind-1
annos[name][gene][1] = end_ind-1
结果:
In [3]: annos
Out[3]: {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}
来自上述3个基因annos
的结果.只需替换annos
变量:
Results from the 3 gene annos
above. Just replace the annos
variable:
In [5]: annos3
Out[5]: {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}
这篇关于改进DNA比对解盖的代码设计的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!