如何从Fasta文件中删除所有N个序列条目 [英] How to remove all-N sequence entries from fasta file(s)

查看：447 发布时间：2020/11/12 22:12:07 python perl awk

本文介绍了如何从Fasta文件中删除所有N个序列条目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想删除所有核苷酸均为N的fasta文件条目，但不删除包含ACGT和N个核苷酸的条目.

I would like to remove entries of a fasta file where all nucleotides are N, but not entries which contain ACGT and N nucleotides.

来自输入文件内容的示例:

from an example of input file content:

#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...

希望输出文件内容为:

#>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
#>seq3
catgcatcgacgatgctgacgatc
#>seq4
cacacaccNNNNttgtgca
#...

在使用awk，perl，python和其他工具时有任何建议吗? 谢谢！ FDS

Any suggestions in doing this with awk, perl, python, other? Thank you! FDS

推荐答案

使用python，使用BioPython模块:

With python, using the BioPython module:

import Bio

INPUT  = "bio_input.fas"
OUTPUT = "bio_output.fas"

def main():
    records = Bio.SeqIO.parse(INPUT, 'fasta')
    filtered = (rec for rec in records if any(ch != 'N' for ch in rec.seq))
    Bio.SeqIO.write(filtered, OUTPUT, 'fasta')

if __name__=="__main__":
    main()

但是请注意，FastA规范指出序列ID应该以>开头，而不是以#>开头！

however note that the FastA spec says sequence ids are supposed to start with >, not #>!

与之对抗

>seq_1
TGCTAGCTAGCTGATCGTGTCGATCG
CACCACANNNNNCACGTGTCG
>seq2
NNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNN
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca

这产生

>seq_1
TGCTAGCTAGCTGATCGTGTCGATCGCACCACANNNNNCACGTGTCG
>seq3
catgcatcgacgatgctgacgatc
>seq4
cacacaccNNNNttgtgca

(请注意，默认换行长度为60个字符).

(Note, the default linewrap length is 60 chars).

这篇关于如何从Fasta文件中删除所有N个序列条目的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何从Fasta文件中删除所有N个序列条目 [英] How to remove all-N sequence entries from fasta file(s)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何从Fasta文件中删除所有N个序列条目 [英] How to remove all-N sequence entries from fasta file(s)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭