在FASTA文件中删除换行符 [英] Remove line breaks in a FASTA file

查看:2361
本文介绍了在FASTA文件中删除换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个FASTA文件,其中的序列被打破了换行。我想删除的换行符。这是我的文件的例子:

I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:

>accession1
ATGGCCCATG
GGATCCTAGC
>accession2
GATATCCATG
AAACGGCTTA

我想它转换成这样:

I'd like to convert it into this:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

我发现在这个网站,它看起来像这样一个潜在的解决方案

I found a potential solution on this site, which looks like this:

cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta

不过,这个地方每个条目之间的一个额外的换行符,所以文件看起来是这样的:

However, this places an extra line break between each entry, so file looks like this:

>accession1
ATGGCCCATGGGATCCTAGC

>accession2
GATATCCATGAAACGGCTTA

我是一个awk小白,但我还是把射击在修改命令。我的猜测是如果(P){打印\\ n;} 是罪魁祸首...潜在打印\\ n是增加了两个换行符。我无法弄清楚如何添加一个换行......这可能是一些容易,但就像我说的,我是一个菜鸟。这里是我的(失败)解决方案:

I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";} was the culprit...potentially print "\n" is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:

awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta

然而,这增加了在文件开头的空行,因为它总是打印新的生产线将打印的第一个加入号码前:

However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:

{empty line} 
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

任何人都有一个解决方案,使我的文件格式是否正确?谢谢!

Anyone have a solution to get my file in the correct format? Thanks!

推荐答案

AWK 程序:

% awk '!/^>/ { printf "%s", $0; n = "\n" } 
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta

将产生:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

说明:

在不以&GT开头的行; ,打印时不用换行和存储换行符(变量ñ ),供以后使用。

Explanation:

On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.

在那些开始与&GT线; ,打印存储的换行符(如果有的话)而行。重置 N ,如果这是最后一道防线。

On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.

用新行结束,如果需要的话。

End with a newline, if required.

在默认情况下,变量被初始化为空字符串。没有必要明确初始化,在 AWK ,这是你会在ç和其他大多数传统语言。

By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.

- 6.1.3.1在程序使用变量,<一个HREF =htt​​p://www.gnu.org/software/gawk/manual/gawk.html>的GNUAwk用户指南

--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide

这篇关于在FASTA文件中删除换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆