从 fasta 文件打印序列 [英] Printing a sequence from a fasta file

查看:23
本文介绍了从 fasta 文件打印序列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我经常需要在 fasta 文件中找到特定的序列并打印出来.对于那些不知道的人来说,fasta 是一种用于生物序列(DNA、蛋白质等)的文本文件格式.这很简单,你有一行序列名称前面有一个>",然后直到下一个>"的所有行都是序列本身.例如:

I often need to find a particular sequence in a fasta file and print it. For those who don't know, fasta is a text file format for biological sequences (DNA, proteins, etc.). It's pretty simple, you have a line with the sequence name preceded by a '>' and then all the lines following until the next '>' are the sequence itself. For example:

>sequence1
ACTGACTGACTGACTG
>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>sequence3
ACTGACTGACTGACTG

我目前获得所需序列的方法是将 grep 与 -A 一起使用,所以我会这样做

The way I'm currently getting the sequence I need is to use grep with -A, so I'll do

grep -A 10 sequence_name filename.fa

然后如果我在文件中没有看到下一个序列的开始,我会将 10 更改为 20 并重复,直到我确定我得到了整个序列.

and then if I don't see the start of the next sequence in the file, I'll change the 10 to 20 and repeat until I'm sure I'm getting the whole sequence.

似乎应该有更好的方法来做到这一点.例如,我可以要求它一直打印到下一个 '>' 字符吗?

It seems like there should be a better way to do this. For example, can I ask it to print up until the next '>' character?

推荐答案

使用>作为记录分隔符:

awk -v seq="sequence2" -v RS='>' '$1 == seq {print RS $0}' file

>sequence2
ACTGACTGACTGACTG
ACTGACTGACTGACTG

这篇关于从 fasta 文件打印序列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆