从FASTA文件打印序列 [英] Printing a sequence from a fasta file

查看:276
本文介绍了从FASTA文件打印序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常需要找到一个FASTA文件中的特定序列,并打印出来。对于那些不知道是谁,FASTA是生物序列(DNA,蛋白质等)的文本文件格式。这是pretty简单,你有一个'>'pceded序列名称$ P $行,然后按照直到下一个'>'是序列本身的所有行。例如:

I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example:

>sequence1
ACTGACTGACTGACTG
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>sequence3
ACTGACTGACTGACTG

目前我得到我所需要的序列的方法是使用grep有-A,所以我会做

The way I'm currently getting the sequence I need is to use grep with -A, so I'll do

grep -A 10 sequence_name filename.fa

,然后如果我没有看到文件中的下一个序列的开始,我会改的10到20个,并重复,直到我敢肯定,我得到了整个序列。

and then if I don't see the start of the next sequence in the file, I'll change the 10 to 20 and repeat until I'm sure I'm getting the whole sequence.

好像应该有更好的方式来做到这一点。例如,我可以问它打印,直到下一个'>'字符?

It seems like there should be a better way to do this. For example, can I ask it to print up until the next '>' character?

推荐答案

使用> 作为记录分隔符:

awk -v seq="sequence2" -v RS='>' '$1 == seq {print RS $0}' file

>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG

这篇关于从FASTA文件打印序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆