使用生成器(python)解析fasta文件 [英] parsing a fasta file using a generator ( python )
问题描述
我正在尝试解析一个大的fasta文件,我遇到了内存不足的错误。一些建议,以改善数据处理将不胜感激。目前,程序正确打印出名称,但部分通过该文件获得了MemoryError
I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program correctly prints out the names however partially through the file I get a MemoryError
这是发生器
def readFastaEntry( fp ):
name = ""
seq = ""
for line in fp:
if line.startswith( ">" ):
tmp = []
tmp.append( name )
tmp.append( seq )
name = line
seq = ""
yield tmp
else:
seq = seq.join( line )
这里是调用者存根更多将被添加在这部分工作之后
and here is the caller stub more will be added after this part works
fp = open( sys.argv[1], 'r' )
for seq in readFastaEntry( fp ) :
print seq[0]
对于那些与fasta格式不相似的人,这里是一个例子
For those not fimilar with the fasta format here is an example
>1 (PB2)
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>2 (PB1)
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC
每个条目以>开头,说明名称等,接下来的N行是数据。没有定义的数据结尾,除了开头有>的下一行之外。
each entry starts with a ">" stating the name etc then the next N lines are data. There is no defined ending of the data other than the next line having a ">" at the beginning.
推荐答案
你考虑过使用 BioPython 。他们有一个可以读取fasta文件的序列阅读器。如果您有兴趣自己编码,您可以查看 BioPython的代码< a>。
Have you considered using BioPython. They have a sequence reader that can read fasta files. And if you are interested in coding one yourself, you can take a look at BioPython's code.
修改:添加代码
def read_fasta(fp):
name, seq = None, []
for line in fp:
line = line.rstrip()
if line.startswith(">"):
if name: yield (name, ''.join(seq))
name, seq = line, []
else:
seq.append(line)
if name: yield (name, ''.join(seq))
with open('f.fasta') as fp:
for name, seq in read_fasta(fp):
print(name, seq)
这篇关于使用生成器(python)解析fasta文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!