从Fasta文件中删除多个序列 [英] Remove multiple sequences from fasta file
问题描述
我有一个字符序列的文本文件,该文件由两行组成:标题和下一行的序列本身.该文件的结构如下:
I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
在另一个文件中,我有一个要删除的序列标头的列表,如下所示:
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
这个想法是从第一个文件中删除这些序列,因此所有这些标头+下一行.我是使用sed进行的,如下所示,
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
它可以工作,但是要花很长时间,因为我用sed多次加载了整个文件,而且它很大.关于如何加快此过程的任何想法吗?
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?
推荐答案
使用第二个文件中的删除命令创建脚本:
Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
然后将该文件应用于第一个
Then apply that file to the first
sed -f commands.sed firstFile.txt
这篇关于从Fasta文件中删除多个序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!