使用循环将特定长度的连字符序列添加到序列中 [英] Adding in sequences of hyphens, of a specific length, into sequences using loops
问题描述
我有三个文本文件:
>xx_oneFish |xxx
AAAAAAA
>xx_twoFish |xxx
CCCCCC
>xx_redFish |xxx
TTTTTT
>xx_blueFish |xxx
GGGGGG
>xx_oneFish |xxx
aaaa
>xx_twoFish |xxx
cccc
>xx_redFish |xxx
tt
>xx_blueFish |xxx
gg
我正在尝试获得一个输出,其中每种物种(redFish,blueFish等)的字母序列以列表中出现的顺序排列在列表中,顺序与它们在目录中文件中出现的顺序相同储存.每个物种都有一个嵌套列表.
I am trying to get an output, where the sequences of letters for each species (redFish, blueFish etc.) are put together in a list, in the same order as they appear in the files in my directory where the sequences are stored. There will be one nested list for each of the species.
如果文件不包含某个物种的序列,我想添加一串连字符,其长度与文件中其他物种的序列相同.
If a file contains no sequence for a species, I want to add in a string of hyphens that is the same length as the sequence present in the file for other species.
即对于此数据集,输出应如下所示:
i.e. for this dataset the output should look like this:
[['--', 'aaaa', 'AAAAAAA'], ['--', 'cccc', 'CCCCCC'], [ 'tt', '----', 'TTTTTT'], ['gg', '----', 'GGGGGG']]
这是我当前的代码:
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
concatSeq = [[], [], [], []]
import os
testSequences = []
testNames = []
for filename in os.listdir("./"): #go to directory where aligned files are kept
if filename.endswith(".txt"): #open files which have been aligned with MAFFT
fastaFile = open(filename, 'r')
temp_sub_list_names = []
temp_sub_list_seq = []
for line in fastaFile:
line = line.strip()
if line:
if not line.startswith('>'):
temp_sub_list_seq.append(line)
else:
temp_sub_list_names.append(line)
testSequences.append(temp_sub_list_seq)
testNames.append(temp_sub_list_names)
for i in range(len(testNames)):
for k in range(len(testNames[i])):
for j in range(len(differentNames)):
if differentNames[j] in testNames[i][k]: #check whether the sequence names match up
concatSeq[j].append(testSequences[i][k]) #if they do, add the sequence to the corresponding list
c = 1
for a in range(len(concatSeq)):
# for b in range(len(concatSeq[a]):
if len(concatSeq[a]) < c:
hyphenString = "-" * len(testSequences[c-1][0])
concatSeq[a].append(hyphenString)
c+=1
print concatSeq
在最终循环中出了点问题,因为这是我的输出:
Something is going wrong in the final loop, as this is my output:
[['aaaa', 'AAAAAAA'], ['----', 'cccc', 'CCCCCC'], ['----', 'tt', 'TTTTTT'], ['----', 'gg', 'GGGGGG']]
推荐答案
如果您不介意使用re
模块来解析文件,则可以使用以下示例:
If you don't mind using re
module for parsing the files, you can use this example:
file_1 = '''>xx_oneFish |xxx
AAAAAAA
>xx_twoFish |xxx
CCCCCC
>xx_redFish |xxx
TTTTTT
>xx_blueFish |xxx
GGGGGG'''
file_2 = '''>xx_oneFish |xxx
aaaa
>xx_twoFish |xxx
cccc'''
file_3 = '''>xx_redFish |xxx
tt
>xx_blueFish |xxx
gg'''
import re
from collections import OrderedDict
f1 = OrderedDict(re.findall(r'>.*?_(.*?)\s.*?\n(.*?)(?=\n|\Z)', file_1, flags=re.DOTALL))
f2 = OrderedDict(re.findall(r'>.*?_(.*?)\s.*?\n(.*?)(?=\n|\Z)', file_2, flags=re.DOTALL))
f3 = OrderedDict(re.findall(r'>.*?_(.*?)\s.*?\n(.*?)(?=\n|\Z)', file_3, flags=re.DOTALL))
differentNames = {'oneFish', 'twoFish', 'redFish', 'blueFish'}
d = OrderedDict()
for i, f in enumerate([f1, f2, f3]):
for k, v in f.items():
if k in differentNames: # <-- comment this out if you want to check for all species in files
d.setdefault(k, []).append((i, v))
lengths = dict(i for v in d.values() for i in v)
vals = []
for k, v in d.items():
dd = dict(v)
vals.append([dd.get(i, '-' * len(lengths[i])) for i in range(len(lengths))][::-1])
from pprint import pprint
pprint(vals)
打印:
[['--', 'aaaa', 'AAAAAAA'],
['--', 'cccc', 'CCCCCC'],
['tt', '----', 'TTTTTT'],
['gg', '----', 'GGGGGG']]
这篇关于使用循环将特定长度的连字符序列添加到序列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!