在FASTA文件中删除换行符 [英] Remove line breaks in a FASTA file
问题描述
我有一个FASTA文件,其中的序列被打破了换行。我想删除的换行符。这是我的文件的例子:
I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:
>accession1
ATGGCCCATG
GGATCCTAGC
>accession2
GATATCCATG
AAACGGCTTA
我想它转换成这样:
I'd like to convert it into this:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
我发现在这个网站,它看起来像这样一个潜在的解决方案:
I found a potential solution on this site, which looks like this:
cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta
不过,这个地方每个条目之间的一个额外的换行符,所以文件看起来是这样的:
However, this places an extra line break between each entry, so file looks like this:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
我是一个awk小白,但我还是把射击在修改命令。我的猜测是如果(P){打印\\ n;}
是罪魁祸首...潜在打印\\ n
是增加了两个换行符。我无法弄清楚如何添加一个换行......这可能是一些容易,但就像我说的,我是一个菜鸟。这里是我的(失败)解决方案:
I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";}
was the culprit...potentially print "\n"
is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:
awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta
然而,这增加了在文件开头的空行,因为它总是打印新的生产线将打印的第一个加入号码前:
However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:
{empty line}
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
任何人都有一个解决方案,使我的文件格式是否正确?谢谢!
Anyone have a solution to get my file in the correct format? Thanks!
推荐答案
本 AWK
程序:
% awk '!/^>/ { printf "%s", $0; n = "\n" }
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta
将产生:
>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA
说明:
在不以&GT开头的行;
,打印时不用换行和存储换行符(变量ñ
),供以后使用。
Explanation:
On lines that don't start with a >
, print the line without a line break and store a newline character (in variable n
) for later.
在那些开始与&GT线;
,打印存储的换行符(如果有的话)而行。重置 N
,如果这是最后一道防线。
On lines that do start with a >
, print the stored newline character (if any) and the line. Reset n
, in case this is the last line.
用新行结束,如果需要的话。
End with a newline, if required.
在默认情况下,变量被初始化为空字符串。没有必要明确初始化,在 AWK 变量一>,这是你会在ç和其他大多数传统语言。
By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in awk, which is what you would do in c and in most other traditional languages.
- 6.1.3.1在程序使用变量,<一个HREF =http://www.gnu.org/software/gawk/manual/gawk.html>的GNUAwk用户指南
--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide
这篇关于在FASTA文件中删除换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!