循环来自两个FASTA文件的ID [英] loop over ids from two FASTA files

查看:165
本文介绍了循环来自两个FASTA文件的ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个具有多个序列的fasta文件

I have two fasta files with multiple sequences

cat file1.fasta
>1
ACGTCGAT
>2
ACTTTATT
>3
ACGGGG

cat file2.fasta
>1
CCGGAGC
>2
TGTCAGTC
>3
CTACGTCTT

我还为每个fasta文件提供了ID列表,我想使用这些ID来按ID提取特定序列,制作2个序列fasta,然后执行一些操作(对齐,计算距离).

I also have a list of IDs for each fasta file that I want to use to extract specific sequences by ID, make a 2 sequence fasta and then perform some operations (align, calc distance).

列表:

cat file1.list
1
3
cat file2.list
2
1 

实际上,这些fasta文件和列表的长度为数千个序列/行

In reality these fasta files and lists are thousands of sequences/lines long

我正在尝试遍历列表中的每一行,以提取与该特定ID/行匹配的fasta文件,然后将每个文件中的fasta序列组合成两个可以对齐的序列fasta文件,以此类推. ,我希望每个Fasta序列与其"pair"成对排列.

I am trying to loop over each line in the lists to extract the fasta file that matches that particular id/line, then combine the fasta sequence from each file in to a two sequence fasta file that can be aligned, etc. Basically, I want a pairwise alignment of each fasta sequence with its "pair".

因此,根据此处的示例和列表ID顺序,我想将file1.fasta中的fasta序列1与file2.fasta中的fasta序列2配对,然后移至下一个对(file1.fasta中的序列3 ,以及来自file2.fasta等的序列1).根据id提取fasta序列相对容易(有几种方法可以做到),但是faOneRecord只是将要从中提取的fasta文件作为输入,然后是要查找的记录/id,然后返回fasta序列和标头:

So based on the example here, and the list ID order, I want to pair fasta sequence 1 from file1.fasta with fasta sequence 2 from file2.fasta, then move on to the next pair (sequence 3 from file1.fasta, and sequence 1 from file2.fasta, etc). Extracting fasta sequences based on id is relatively easy (a few ways to do it), but one is faOneRecord which just takes as input the fasta file you want to extract from, then the record/id you want to find, and returns the fasta sequence and header:

faOneRecord <in.fa> <recordName>

因此,在第一个循环之后,我将基于id列表创建此文件:

So, after the first loop, I would have this file created based on the id list:

>1
ACGTCGAT
>2
TGTCAGTC

,依此类推.

我认为这相对容易实现,但是我似乎无法达到目标.然后,一旦使这2个序列为fasta,即每个循环,我都想对齐并获取距离估计值,将其打印到文件中并转到下一个循环.其余的工作可能需要一些工作,并且需要特定的程序,但是我需要帮助,只需生成在id上提取/循环的2序列fasta.

I would think this is relatively easy to do, but I can't seem to get there. Then once I make that 2 sequence fasta, each loop, I want to align and get distance estimates, print out to a file and go to the next loop. The rest of that may take some work and requires specific programs, but I need help just producing the 2 sequence fasta extracted/looped over the ids.

我猜主要的问题是如何遍历id,然后将这些ID作为参数传递给faOneRecord命令

I guess the major question is how to loop over the ids and then pipe those IDs as arguments into the faOneRecord command

这可能太具体了,如果可以的话,我对此表示歉意,但是任何有关入门的想法都将有所帮助并受到赞赏.

This might be too specific, and if so I apologize, but any ideas on how to get started would be helpful and much appreciated.

推荐答案

这是python解决方案的草图(不完整).正如我在评论中所说,有两个步骤:

Here's an (incomplete) sketch of a python solution. As I said in the comment, there's two steps:

首先,读取数组中的两个文件.如果确定它们与示例完全相同,则可以忽略>x行:

First, read both files in arrays. If you are sure they are exactly as in your example, you can just ignore the >x lines:

fasta1 = [''] # make sure the first item is saved to fasta1[1], not fasta[0]
for line in open('file1.fasta'):
    if not line.startswith('>'):
        fasta1.append(line.strip())

for line in open()只是打开文件并遍历其行.

The for line in open() just opens the file and iterates over its lines.

对file2执行相同的操作.然后,您可以交替读取list文件,取出数字并打印匹配的序列:

Do the same for file2. Then you can read the list files alternatingly, get the numbers out and print the matching sequence:

for l1, l2 in zip(open('file1.list'), open('file2.list')):
    print(fasta1[int(l1)])
    print(fasta2[int(l1)])

zip 读取两个文件并读入并行执行,以便第一次执行循环时,l1l2分别包含file1.listfile2.list的第一行;第二次,这是每个等等的第二行.

zip takes the two files and reads them in in parallel, so that the first time the loop is executed, l1 and l2 contain the first line of file1.list and file2.list, respectively; the second time, it's the second line of each, etc.

这篇关于循环来自两个FASTA文件的ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆