FASTA文件的序列长度 [英] Sequence length of fasta file
问题描述
我有以下的fasta文件:
I have the following fasta file:
>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT
弥所需的输出:
>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.
这是我的code:
awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa
输出我有这个code是:
The output I have with this code is:
>header1
60
57
>header2
3
>header3
7
So..I需要一个小的修改,以处理多个序列行...
同时,我需要一种方法来拥有总序列和总长度。任何建议将受到欢迎......在bash或AWK吧。我知道这是很容易做到这一点在Perl / bioperl的,实际上我有一个脚本做那些方面。
So..I need a "little" modification in order to deal with multiple sequence lines... And also I need a way to have the total sequences and total length. Any suggestion will be welcome...in bash or awk please. I know that is easy to do it in perl/bioperl and actually I have a script to do it in those ways.
感谢
推荐答案
这是 AWK
/ GAWK
办法可以解决由三个阶段组成:
An awk
/ gawk
solution can be composed by three stages:
-
每次
头
找到这些行动应该执行:
Every time
header
is found these actions should be performed:
- 打印previous seqlen 如果存在
- 打印标记。
- 初始化 seqlen 的
- Print previous seqlen if exists.
- Print tag.
- Initialize seqlen.
评论code:
awk '/^>/ { # header pattern detected
if (seqlen){
# print previous seqlen if exists
print seqlen
}
# pring the tag
print
# initialize sequence
seqlen = 0
# skip further processing
next
}
# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa
A oneliner 的:
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
有关总计:
awk '/^>/ { if (seqlen) {
print seqlen
}
print
seqtotal+=seqlen
seqlen=0
seq+=1
next
}
{
seqlen += length($0)
}
END{print seqlen
print seq" sequences, total length " seqtotal+seqle
}' file.fa
这篇关于FASTA文件的序列长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!