从Fasta文件中删除多个序列 [英] Remove multiple sequences from fasta file

查看:695
本文介绍了从Fasta文件中删除多个序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符序列的文本文件,该文件由两行组成:标题和下一行的序列本身.该文件的结构如下:

I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:

>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa

在另一个文件中,我有一个要删除的序列标头的列表,如下所示:

In an other file I have a list of headers of sequences that I would like to remove, like this:

>header1
>header5
>header12
[...]
>header145

这个想法是从第一个文件中删除这些序列,因此所有这些标头+下一行.我是使用sed进行的,如下所示,

The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,

while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt

它可以工作,但是要花很长时间,因为我用sed多次加载了整个文件,而且它很大.关于如何加快此过程的任何想法吗?

It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?

推荐答案

使用第二个文件中的删除命令创建脚本:

Create a script with the delete commands from the second file:

sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed

然后将该文件应用于第一个

Then apply that file to the first

sed -f commands.sed firstFile.txt 

这篇关于从Fasta文件中删除多个序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆