使用BioPython根据序列过滤FASTA文件 [英] Filtering a FASTA file based on sequence with BioPython

查看:286
本文介绍了使用BioPython根据序列过滤FASTA文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个fasta文件.从该文件中,我需要获取序列末尾和/或开头仅包含GTACAGTAGGCAACGGTTTTGCC的序列,并将它们放入新的fasta文件中.所以这是一个例子:

I have a fasta file. From that file, I need to get the only sequences containing GTACAGTAGG and CAACGGTTTTGCC at the end and/or start of the sequence and put them in a new fasta file. So here's an example:

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/2516_3269
***GTACAGTAGG***GTACACACAGAACGCGACAAGGCCAGGCGCTGGAGGAACTCCAGCAGCTAGATGCAAGCGACTA
TCAGAGCGTTGGGTCCAGAACGAAGAACAGTCACTCAAGACTGCTTT***CAACGGTTTTGCC***

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3312_3597
CGCGGCATCGAATTAATACGACTCACTATAGGTTTTTTTATTGGTATTTTCAGTTAGATTCTTTCTTCTTAGAGGGTACA
GAGAAAGGGAGAAAATAGCTACAGACATGGGAGTGAAAGGTAGGAAGAAGAGCGAAGCAGACATTATTCA

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3708_4657
***CAACGGTTTTGCC***ACAAGATCAGGAACATAAGTCACCAGACTCAATTCATCCCCATAAGACCTCGGACCTCTCA
ATCCTCGAATTAGGATGTTCTCCCCATGGCGTACGGTCTATCAGTATATAAACCTGACATACTATAAAAAAGTATACCAT
TCTTATCATGTACAGTAGG***GTACAGTAGG***

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/4704_5021
***GTACAGTAGG***GTGGGAGAGATGGCAGAAAGGCAGAAAGGAGAAAGATTCAGGATAACTCTCCTGGAGGGGCGAG
GTGCCATTCCCTGTGGTCACTTATTCTAAAGGCCCCAACCCTTCAAC***CAACGGTTTTGCC***

>m121012_054644_42133_c100390582550000001523038311021245_s1_p0/8/4223_4358
AAATATTGGGTCAAAGAACCGTTACTTTTCTTATATATGCGGCGCGAGGTTTTATATACTGATAAGAACCTACGCCATGG
GACATCTAATTCAGAGGGAAGAAGGTCCATGTCTGTTTGGATGAAATTGAGTCTG

(添加了*以突出显示)

我需要某种方法来获取序列末尾和/或序列开头仅包含GTACAGTAGG和CAACGGTTTTGCC的序列,并将其保存在新的fasta文件中.我对此很陌生.我什至不确定是否可以做到.预先感谢您提供的任何帮助.

I need some way to get the only sequences containing GTACAGTAGG and CAACGGTTTTGCC at the end and/or start of the sequences and get them out in a new fasta file. I'm very new to this. I'm not even sure if it can be done. Thanks in advance for any help you can give.

推荐答案

这是在Biopython中完成该任务的一种方法:

Here's one way to do it in Biopython:

from Bio import SeqIO

source = 'fasta_file_name.fa'
outfile = 'filtered.fa'
sub1 ='GTACAGTAGG'
sub2 = 'CAACGGTTTTGCC'

def seq_check(seq, sub1, sub2):
    # basically a function to check whether seq starts or ends with sub1 or sub2
    return seq.startswith(sub1) or seq.startswith(sub2) or \
        seq.endswith(sub1) or seq.endswith(sub2)

seqs = SeqIO.parse(source, 'fasta')
filtered = (seq for seq in seqs if seq_check(seq.seq, sub1, sub2))
SeqIO.write(filtered, outfile, 'fasta')

我们使用的是生成器表达式(以"filtered"开头的行),因此过滤是在程序读取源文件时完成的.这具有节省内存的优势.

We're using generator expression (the line that begins with 'filtered') so the filtering is done as the program's reading through the source file. This has the advantage of being memory-efficient.

我们还将创建一个新功能来检查序列的开始和结束,以使程序更具可读性.从理论上讲,您可以在生成器表达式中进行检查,但这会使行不必要地变长.

We're also creating a new function to check the starts and ends of the sequence to make the program more readable. In theory, you could do the check inside the generator expression, but that would make the line unecessarily long.

希望有帮助:).

这篇关于使用BioPython根据序列过滤FASTA文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆