如何合并两个fasta文件并删除重复的信息? [英] How to merge two fasta files and remove the duplicate information?
问题描述
我想合并两个fasta文件并删除重复的信息。
以下是一些示例
> Symbiotaphrina_buchneri | DQ248313 | SH1641879.08FU |代表| k__Fungi; p__Ascomycota; c__Xylonomycetes; o__Symbiotaphrinales; f__Symbiotaphrinaceae; g__Symbiotaphrina; s__Symbiotaphrina_buchneri
ACGATTTTGACCCTTCGGGGTCGATCTCCAACCCTTTGTCTACCTTCCTTGTTGCTTTGGCGGGCCGATGTTCGTTCTCGCGAACGACACCGCTGGCCTGACGGCTGGTGCGCGCCCGCCAGAGTCCACCAAAACTCTGATTCAAACCTACAGTCTGAGTATATATTATATTAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCTCAGCTTGGTATTGGGTCATCGTCTGGTCACACAGGCGTGCCTGAAAATCAGTGGCGGTGCCCATCCGGCTTCAAGCATAGTAATTTCTATCTTGCTTTGGAAGTCTCCGGAGGGTTACACCGGCCAACAACCCCAATTTTCTATG
个Dactylonectria_anthuriicola | JF735302 | SH1546329 .08FU | refs | k__Fungi; p__Ascomycota; c__Sordariomycetes; o__Hypocreales; f__Nectriaceae; g__Dactylonectria; s__Dactylonectria_anthurii可乐
CCGAGTTTTCAACTCCCAAACCCCTGTGAACATACCATTTTGTTGCCTCGGCGGTGCCTGTTCCGACAGCCCGCCAGAGGACCCCAAACCCAAATTTCCTTGAGTGAGTCTTCTGAGTAACCGATTAAATAAATCAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGCCAGTATTCTGGCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAGCCCCCGGGCTTGGTGTTGGGGATCGGCGAGCCTCTGCGCCCGCCGTCCCCTAAATTGAGTGGCGGTCACGTTGTAACTTCCTCTGCGTAGTAGCACACTTAGCACTGGGAAACAGCGCGGCCACGCCGTAAAACCCCCAACTTTGAACG
个Ilyonectria_robusta | JF735264 | SH1546327.08FU |参| k__Fungi; p__Ascomycota; c__Sordariomycetes; o__Hypocreales; f__Nectriaceae; g__Ilyonectria; s__Ilyonectria_robusta
CCGAGTTTACAACTCCCAAACCCCTGTGAACATACCATATTGTTGCCTCGGCGGTGTCTGTTTCGGCAGCCCGCCAGAGGACCCAAACCCTAGATTACATTAAAGCATTTTCTGAGTCAATGATTAAATCAATCAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGCCAGTATTCTGGCGGGCATGCCTGTCCGAGCGTCATTTCAACCCTCAAGCCCCCGGGCTTGGTGTTGGAGATCGGCGAGCCCCCCGGGGCGCGCCGTCTCCCAAATA TAGTGGCGGTCCCGCTGTAGCTTCCTCTGCGTAGTAGCACACCTCGCACTGGGAAACAGCGTGGCCACGCCGTAAAACCCCCCACTTCTGAAAG
个Symbiotaphrina_buchneri | DQ248313 | SH1641879.08FU |代表| k__Fungi; p__Ascomycota; c__Xylonomycetes; o__Symbiotaphrinales; f__Symbiotaphrinaceae; g__Symbiotaphrina; s__Symbiotaphrina_buchneri
ACGATTTTGACCCTTCGGGGTCGATCTCCAACCCTTTGTCTACCTTCCTTGTTGCTTTGGCGGGCCGATGTTCGTTCTCGCGAACGACACCGCTGGCCTGACGGCTGGTGCGCGCCCGCCAGAGTCCACCAAAACTCTGATTCAAACCTACAGTCTGAGTATATATTATATTAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCTCAGCTTGGTATTGGGTCATCGTCTGGTCACACAGGCGTGCCTGAAAATCAGTGGCGGTGCCCATCCGGCTTCAAGCATAGTAATTTCTATCTTGCTTTGGAAGTCTCCGGAGGGTTACACCGGCCAACAACCCCAATTTTCTATG
我已经尝试
$ cat Unite / sh_general_release_dynamic_02.02.2019.fasta b
Unite_61635 / sh_general_release_dynamic_s_02.02.2019.fasta \
> mergeUnite / MergeUnite.temp.fasta
合并文件后,我使用 fastx_collapser
折叠重复的信息。但是,在使用fastx_collapser之后,我将丢失分类信息并成为:
> 1-234
ATCG。 .......
预期输出应为:
> Symbiotaphrina_buchneri | DQ248313 | SH1641879.08FU |代表| k__Fungi; p__Ascomycota; c__Xylonomycetes; o__Symbiotaphrinales; f__Symbiotaphrinaceae; g__Symbiotaphrina; s__Symbiotaphrina_buchneri
ACGATTTTGACCCTTCGGGGTCGATCTCCAACCCTTTGTCTACCTTCCTTGTTGCTTTGGCGGGCCGATGTTCGTTCTCGCGAACGACACCGCTGGCCTGACGGCTGGTGCGCGCCCGCCAGAGTCCACCAAAACTCTGATTCAAACCTACAGTCTGAGTATATATTATATTAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCTCAGCTTGGTATTGGGTCATCGTCTGGTCACACAGGCGTGCCTGAAAATCAGTGGCGGTGCCCATCCGGCTTCAAGCATAGTAATTTCTATCTTGCTTTGGAAGTCTCCGGAGGGTTACACCGGCCAACAACCCCAATTTTCTATG
> Dactylonectria_anthuriicola | JF735302 | SH1546329.08FU | refs | k__Fungi; p__Ascomyc OTA; c__Sordariomycetes; o__Hypocreales; f__Nectriaceae; g__Dactylonectria; s__Dactylonectria_anthuriicola
CCGAGTTTTCAACTCCCAAACCCCTGTGAACATACCATTTTGTTGCCTCGGCGGTGCCTGTTCCGACAGCCCGCCAGAGGACCCCAAACCCAAATTTCCTTGAGTGAGTCTTCTGAGTAACCGATTAAATAAATCAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGCCAGTATTCTGGCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAGCCCCCGGGCTTGGTGTTGGGGATCGGCGAGCCTCTGCGCCCGCCGTCCCCTAAATTGAGTGGCGGTCACGTTGTAACTTCCTCTGCGTAGTAGCACACTTAGCACTGGGAAACAGCGCGGCCACGCCGTAAAACCCCCAACTTTGAACG
个Ilyonectria_robusta | JF735264 | SH1546327.08FU |参| k__Fungi; p__Ascomycota; c__Sordariomycetes; o__Hypocreales; f__Nectriaceae; g__Ilyonectria; s__Ilyonectria_robusta
CCGAGTTTACAACTCCCAAACCCCTGTGAACATACCATATTGTTGCCTCGGCGGTGTCTGTTTCGGCAGCCCGCCAGAGGACCCAAACCCTAGATTACATTAAAGCATTTTCTGAGTCAATGATTAAATCAATCAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGCCAGTATTCT < 解决方案以下awk行将删除重复的信息。有3种方法可以查看重复项:
序列名称相同:
简称为:
$ awk'/ ^> / {p = seen [$ 0] + +}!p'file1.fasta file2.fasta file3.fasta ...
但是,以下内容版本引入了更多的清晰度,并允许任何用户快速适应其需求:
$ awk'BEGIN {RS = >; FS = \n; ORS =}
(FNR == 1){下一个}
{name = $ 1; seq = $ 0; gsub(/(^ [^ \n] * |)\n /,,seq)}
!(seen [name] ++){print> $ 0}'file1.fasta file2.fasta file3.fasta ...
在这里我们引入了变量 name
保存序列名称,变量 seq
保存序列本身。多行序列在变量中移至单行。
如前所述,当使用其他度量来确定重复项时,这很容易适应。例如,
序列名称的第一部分相同:
$ awk'BEGIN {RS =>; FS = \n; ORS =}
(FNR == 1){下一个}
{name = $ 1; seq = $ 0; gsub(/(^ [^ \n] * |)\n /,,seq)}
{key = substr(name,1,index(s, |)))}
!(seen [key] ++){print> $ 0}'file1.fasta file2.fasta file3.fasta ...
序列相同:
$ awk'BEGIN {RS =>; FS = \n; ORS =}
(FNR == 1){下一个}
{name = $ 1; seq = $ 0; gsub(/(^ [^ \n] * |)\n /,,seq)}
!(seen [seq] ++){print> $ 0}'file1.fasta file2.fasta file3.fasta ...
序列名称和序列相同:
$ awk'BEGIN {RS =>; FS = \n; ORS =}
(FNR == 1){下一个}
{name = $ 1; seq = $ 0; gsub(/(^ [^ \n] * |)\n /,,seq)}
!(seen [name,seq] ++){print> $ 0}'file1.fasta file2.fasta file3.fasta ...
在某些地方,您可以-课程清理。您不一定总是需要名称
来确定重复项(请参见序列相同),或者您不一定总是需要 seq
(请参见序列名称相同)。这使您可以删除代码的某些部分。我只是以这种方式保留了它,没有进行清理,以显示您可以使用的方法。
注意:以上内容使用了如果字段重复,则删除行
I want to merge two fasta file and remove the duplicate information.
Here is some example
>Symbiotaphrina_buchneri|DQ248313|SH1641879.08FU|reps|k__Fungi;p__Ascomycota;c__Xylonomycetes;o__Symbiotaphrinales;f__Symbiotaphrinaceae;g__Symbiotaphrina;s__Symbiotaphrina_buchneri
ACGATTTTGACCCTTCGGGGTCGATCTCCAACCCTTTGTCTACCTTCCTTGTTGCTTTGGCGGGCCGATGTTCGTTCTCGCGAACGACACCGCTGGCCTGACGGCTGGTGCGCGCCCGCCAGAGTCCACCAAAACTCTGATTCAAACCTACAGTCTGAGTATATATTATATTAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCTCAGCTTGGTATTGGGTCATCGTCTGGTCACACAGGCGTGCCTGAAAATCAGTGGCGGTGCCCATCCGGCTTCAAGCATAGTAATTTCTATCTTGCTTTGGAAGTCTCCGGAGGGTTACACCGGCCAACAACCCCAATTTTCTATG
>Dactylonectria_anthuriicola|JF735302|SH1546329.08FU|refs|k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Nectriaceae;g__Dactylonectria;s__Dactylonectria_anthuriicola
CCGAGTTTTCAACTCCCAAACCCCTGTGAACATACCATTTTGTTGCCTCGGCGGTGCCTGTTCCGACAGCCCGCCAGAGGACCCCAAACCCAAATTTCCTTGAGTGAGTCTTCTGAGTAACCGATTAAATAAATCAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGCCAGTATTCTGGCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAGCCCCCGGGCTTGGTGTTGGGGATCGGCGAGCCTCTGCGCCCGCCGTCCCCTAAATTGAGTGGCGGTCACGTTGTAACTTCCTCTGCGTAGTAGCACACTTAGCACTGGGAAACAGCGCGGCCACGCCGTAAAACCCCCAACTTTGAACG
>Ilyonectria_robusta|JF735264|SH1546327.08FU|refs|k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Nectriaceae;g__Ilyonectria;s__Ilyonectria_robusta
CCGAGTTTACAACTCCCAAACCCCTGTGAACATACCATATTGTTGCCTCGGCGGTGTCTGTTTCGGCAGCCCGCCAGAGGACCCAAACCCTAGATTACATTAAAGCATTTTCTGAGTCAATGATTAAATCAATCAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGCCAGTATTCTGGCGGGCATGCCTGTCCGAGCGTCATTTCAACCCTCAAGCCCCCGGGCTTGGTGTTGGAGATCGGCGAGCCCCCCGGGGCGCGCCGTCTCCCAAATATAGTGGCGGTCCCGCTGTAGCTTCCTCTGCGTAGTAGCACACCTCGCACTGGGAAACAGCGTGGCCACGCCGTAAAACCCCCCACTTCTGAAAG
>Symbiotaphrina_buchneri|DQ248313|SH1641879.08FU|reps|k__Fungi;p__Ascomycota;c__Xylonomycetes;o__Symbiotaphrinales;f__Symbiotaphrinaceae;g__Symbiotaphrina;s__Symbiotaphrina_buchneri
ACGATTTTGACCCTTCGGGGTCGATCTCCAACCCTTTGTCTACCTTCCTTGTTGCTTTGGCGGGCCGATGTTCGTTCTCGCGAACGACACCGCTGGCCTGACGGCTGGTGCGCGCCCGCCAGAGTCCACCAAAACTCTGATTCAAACCTACAGTCTGAGTATATATTATATTAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCTCAGCTTGGTATTGGGTCATCGTCTGGTCACACAGGCGTGCCTGAAAATCAGTGGCGGTGCCCATCCGGCTTCAAGCATAGTAATTTCTATCTTGCTTTGGAAGTCTCCGGAGGGTTACACCGGCCAACAACCCCAATTTTCTATG
I have tried
$ cat Unite/sh_general_release_dynamic_02.02.2019.fasta \
Unite_61635/sh_general_release_dynamic_s_02.02.2019.fasta \
> mergeUnite/MergeUnite.temp.fasta
After merging the file, I used fastx_collapser
to collapse the duplicate information. However, after using fastx_collapser, I will lose the taxonomy information and become:
>1-234
ATCG........
The expected output should be:
>Symbiotaphrina_buchneri|DQ248313|SH1641879.08FU|reps|k__Fungi;p__Ascomycota;c__Xylonomycetes;o__Symbiotaphrinales;f__Symbiotaphrinaceae;g__Symbiotaphrina;s__Symbiotaphrina_buchneri
ACGATTTTGACCCTTCGGGGTCGATCTCCAACCCTTTGTCTACCTTCCTTGTTGCTTTGGCGGGCCGATGTTCGTTCTCGCGAACGACACCGCTGGCCTGACGGCTGGTGCGCGCCCGCCAGAGTCCACCAAAACTCTGATTCAAACCTACAGTCTGAGTATATATTATATTAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCTCAGCTTGGTATTGGGTCATCGTCTGGTCACACAGGCGTGCCTGAAAATCAGTGGCGGTGCCCATCCGGCTTCAAGCATAGTAATTTCTATCTTGCTTTGGAAGTCTCCGGAGGGTTACACCGGCCAACAACCCCAATTTTCTATG
>Dactylonectria_anthuriicola|JF735302|SH1546329.08FU|refs|k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Nectriaceae;g__Dactylonectria;s__Dactylonectria_anthuriicola
CCGAGTTTTCAACTCCCAAACCCCTGTGAACATACCATTTTGTTGCCTCGGCGGTGCCTGTTCCGACAGCCCGCCAGAGGACCCCAAACCCAAATTTCCTTGAGTGAGTCTTCTGAGTAACCGATTAAATAAATCAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGCCAGTATTCTGGCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAGCCCCCGGGCTTGGTGTTGGGGATCGGCGAGCCTCTGCGCCCGCCGTCCCCTAAATTGAGTGGCGGTCACGTTGTAACTTCCTCTGCGTAGTAGCACACTTAGCACTGGGAAACAGCGCGGCCACGCCGTAAAACCCCCAACTTTGAACG
>Ilyonectria_robusta|JF735264|SH1546327.08FU|refs|k__Fungi;p__Ascomycota;c__Sordariomycetes;o__Hypocreales;f__Nectriaceae;g__Ilyonectria;s__Ilyonectria_robusta
CCGAGTTTACAACTCCCAAACCCCTGTGAACATACCATATTGTTGCCTCGGCGGTGTCTGTTTCGGCAGCCCGCCAGAGGACCCAAACCCTAGATTACATTAAAGCATTTTCTGAGTCAATGATTAAATCAATCAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCGCCAGTATTCTGGCGGGCATGCCTGTCCGAGCGTCATTTCAACCCTCAAGCCCCCGGGCTTGGTGTTGGAGATCGGCGAGCCCCCCGGGGCGCGCCGTCTCCCAAATATAGTGGCGGTCCCGCTGTAGCTTCCTCTGCGTAGTAGCACACCTCGCACTGGGAAACAGCGTGGCCACGCCGTAAAACCCCCCACTTCTGAAAG
Is there another method to do this without losing taxonomy information?
解决方案 The following awk line will remove duplicate information. There are 3 ways I can see how you can detect duplicates:
sequence name identical:
The short version would be:
$ awk '/^>/{p=seen[$0]++}!p' file1.fasta file2.fasta file3.fasta ...
However, the following version introduces a bit more clarity and allows any user to quickly adapt it to his needs:
$ awk 'BEGIN{RS=">"; FS="\n"; ORS=""}
(FNR==1){next}
{ name=$1; seq=$0; gsub(/(^[^\n]*|)\n/,"",seq) }
!(seen[name]++){ print ">" $0 }' file1.fasta file2.fasta file3.fasta ...
Here we introduced the variable name
which holds the sequence name, and the variable seq
that holds the sequence itself. Multi-line sequences are moved to a single line in the variable.
As said before, this is easily adaptable when using other metrics for determining duplications. Eg.
First part of sequence name identical:
$ awk 'BEGIN{RS=">"; FS="\n"; ORS=""}
(FNR==1){next}
{ name=$1; seq=$0; gsub(/(^[^\n]*|)\n/,"",seq) }
{ key=substr(name,1,index(s,"|")) }
!(seen[key]++){ print ">" $0 }' file1.fasta file2.fasta file3.fasta ...
sequence identical:
$ awk 'BEGIN{RS=">"; FS="\n"; ORS=""}
(FNR==1){next}
{ name=$1; seq=$0; gsub(/(^[^\n]*|)\n/,"",seq) }
!(seen[seq]++){ print ">" $0 }' file1.fasta file2.fasta file3.fasta ...
sequence name and sequence identical:
$ awk 'BEGIN{RS=">"; FS="\n"; ORS=""}
(FNR==1){next}
{ name=$1; seq=$0; gsub(/(^[^\n]*|)\n/,"",seq) }
!(seen[name,seq]++){ print ">" $0 }' file1.fasta file2.fasta file3.fasta ...
In some parts you could of-course clean up. You do not always need the name
to determine the duplicate (see sequence identical) or you do not always need the seq
(see sequence name identical). this allows you to remove some parts of the code. I just kept it this way, without cleanup, to show the method you could use.
note: the above makes use of Remove line if field is duplicate
这篇关于如何合并两个fasta文件并删除重复的信息?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!