FASTA文件的序列长度 [英] Sequence length of fasta file

查看:148
本文介绍了FASTA文件的序列长度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下的fasta文件:

I have the following fasta file:

>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT

弥所需的输出:

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

这是我的code:

awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa

输出我有这个code是:

The output I have with this code is:

>header1
60
57
>header2
3
>header3
7

So..I需要一个小的修改,以处理多个序列行...
同时,我需要一种方法来拥有总序列和总长度。任何建议将受到欢迎......在bash或AWK吧。我知道这是很容易做到这一点在Perl / bioperl的,实际上我有一个脚本做那些方面。

So..I need a "little" modification in order to deal with multiple sequence lines... And also I need a way to have the total sequences and total length. Any suggestion will be welcome...in bash or awk please. I know that is easy to do it in perl/bioperl and actually I have a script to do it in those ways.

感谢

推荐答案

这是 AWK / GAWK 办法可以解决由三个阶段组成:

An awk / gawk solution can be composed by three stages:


  1. 每次找到这些行动应该执行:

  1. Every time header is found these actions should be performed:


  • 打印previous seqlen 如果存在

  • 打印标记。

  • 初始化 seqlen

  • Print previous seqlen if exists.
  • Print tag.
  • Initialize seqlen.

评论code:

awk '/^>/ { # header pattern detected
        if (seqlen){
         # print previous seqlen if exists 
         print seqlen
         }

         # pring the tag 
         print

         # initialize sequence
         seqlen = 0

         # skip further processing
         next
      }

# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa

A oneliner 的:

awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa

有关总计:

awk '/^>/ { if (seqlen) {
              print seqlen
              }
            print

            seqtotal+=seqlen
            seqlen=0
            seq+=1
            next
            }
    {
    seqlen += length($0)
    }     
    END{print seqlen
        print seq" sequences, total length " seqtotal+seqle
    }' file.fa

这篇关于FASTA文件的序列长度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆