使用循环将特定长度的连字符序列添加到序列中 [英] Adding in sequences of hyphens, of a specific length, into sequences using loops

查看:65
本文介绍了使用循环将特定长度的连字符序列添加到序列中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有三个文本文件:

>xx_oneFish |xxx
AAAAAAA
>xx_twoFish |xxx
CCCCCC
>xx_redFish |xxx
TTTTTT
>xx_blueFish |xxx
GGGGGG

>xx_oneFish |xxx
aaaa
>xx_twoFish |xxx
cccc

>xx_redFish |xxx
tt
>xx_blueFish |xxx
gg

我正在尝试获得一个输出,其中每种物种(redFish,blueFish等)的字母序列以列表中出现的顺序排列在列表中,顺序与它们在目录中文件中出现的顺序相同储存.每个物种都有一个嵌套列表.

I am trying to get an output, where the sequences of letters for each species (redFish, blueFish etc.) are put together in a list, in the same order as they appear in the files in my directory where the sequences are stored. There will be one nested list for each of the species.

如果文件不包含某个物种的序列,我想添加一串连字符,其长度与文件中其他物种的序列相同.

If a file contains no sequence for a species, I want to add in a string of hyphens that is the same length as the sequence present in the file for other species.

即对于此数据集,输出应如下所示:

i.e. for this dataset the output should look like this:

[['--', 'aaaa', 'AAAAAAA'], ['--', 'cccc', 'CCCCCC'], [ 'tt', '----', 'TTTTTT'], ['gg', '----', 'GGGGGG']]

这是我当前的代码:

differentNames =  ['oneFish', 'twoFish', 'redFish', 'blueFish']
concatSeq = [[], [], [], []]

import os
testSequences = []
testNames = []
for filename in os.listdir("./"): #go to directory where aligned files are kept
    if filename.endswith(".txt"): #open files which have been aligned with MAFFT
        fastaFile = open(filename, 'r') 
        temp_sub_list_names = []
        temp_sub_list_seq = []
        for line in fastaFile:
            line = line.strip()
            if line:
                if not line.startswith('>'):
                    temp_sub_list_seq.append(line)
                else:
                    temp_sub_list_names.append(line)
        testSequences.append(temp_sub_list_seq)
        testNames.append(temp_sub_list_names)

for i in range(len(testNames)):
    for k in range(len(testNames[i])):
        for j in range(len(differentNames)):
            if differentNames[j] in testNames[i][k]: #check whether the sequence names match up
                concatSeq[j].append(testSequences[i][k]) #if they do, add the sequence to the corresponding list
        c = 1
        for a in range(len(concatSeq)):
        #   for b in range(len(concatSeq[a]):
            if len(concatSeq[a]) < c:
                hyphenString = "-" * len(testSequences[c-1][0])
                concatSeq[a].append(hyphenString)
        c+=1


print concatSeq

在最终循环中出了点问题,因为这是我的输出:

Something is going wrong in the final loop, as this is my output:

[['aaaa', 'AAAAAAA'], ['----', 'cccc', 'CCCCCC'], ['----', 'tt', 'TTTTTT'], ['----', 'gg', 'GGGGGG']]

推荐答案

如果您不介意使用re模块来解析文件,则可以使用以下示例:

If you don't mind using re module for parsing the files, you can use this example:

file_1 = '''>xx_oneFish |xxx
AAAAAAA
>xx_twoFish |xxx
CCCCCC
>xx_redFish |xxx
TTTTTT
>xx_blueFish |xxx
GGGGGG'''

file_2 = '''>xx_oneFish |xxx
aaaa
>xx_twoFish |xxx
cccc'''

file_3 = '''>xx_redFish |xxx
tt
>xx_blueFish |xxx
gg'''

import re
from collections import OrderedDict

f1 = OrderedDict(re.findall(r'>.*?_(.*?)\s.*?\n(.*?)(?=\n|\Z)', file_1, flags=re.DOTALL))
f2 = OrderedDict(re.findall(r'>.*?_(.*?)\s.*?\n(.*?)(?=\n|\Z)', file_2, flags=re.DOTALL))
f3 = OrderedDict(re.findall(r'>.*?_(.*?)\s.*?\n(.*?)(?=\n|\Z)', file_3, flags=re.DOTALL))

differentNames = {'oneFish', 'twoFish', 'redFish', 'blueFish'}

d = OrderedDict()
for i, f in enumerate([f1, f2, f3]):
    for k, v in f.items():
        if k in differentNames: # <-- comment this out if you want to check for all species in files
            d.setdefault(k, []).append((i, v))

lengths = dict(i for v in d.values() for i in v)

vals = []
for k, v in d.items():
    dd = dict(v)
    vals.append([dd.get(i, '-' * len(lengths[i])) for i in range(len(lengths))][::-1])

from pprint import pprint
pprint(vals)

打印:

[['--', 'aaaa', 'AAAAAAA'],
 ['--', 'cccc', 'CCCCCC'],
 ['tt', '----', 'TTTTTT'],
 ['gg', '----', 'GGGGGG']]

这篇关于使用循环将特定长度的连字符序列添加到序列中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆