(BioPython)如何停止MemoryError:内存不足异常? [英] (BioPython) How do I stop MemoryError: Out of Memory exception?

查看:104
本文介绍了(BioPython)如何停止MemoryError:内存不足异常?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个程序,其中我要提取一对非常大的多个序列文件(> 77,000个序列,每个序列平均长约1000 bp),并计算每个成对的单个元素之间的比对得分,并将该数字写入输出文件中(我将稍后将加载到excel文件中.

I have a program where I take a pair of very large multiple sequence files (>77,000 sequences each averaging about 1000 bp long) and calculate the alignment score between each paired individual element and write that number into an output file (which I will load into an excel file later).

我的代码适用于小的多序列文件,但是我的大型主文件在分析第16对文件后将引发以下回溯.

My code works for small multiple sequence files but my large master file will throw the following traceback after analyzing the 16th pair.

Traceback (most recent call last):
  File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 109, in <module>
    cycle(f,k,binLen)
  File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 85, in cycle
    a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
  File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 301, in __call__
    return _align(**keywds)
  File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 322, in _align
    score_only)
MemoryError: Out of memory

我已经尝试了许多方法来解决此问题(正如许多人可能从代码中看到的那样),但都无济于事.我尝试将较大的主文件拆分为较小的批处理,以计入分数计算方法.在使用完del文件之后,我尝试了这些文件.我尝试在Oracle虚拟机上使用Ubuntu 11.11(我通常在64位Windows 7中工作).我是否雄心勃勃,这在BioPython中在计算上是否可行?下面是我的代码,我没有内存调试经验,这显然是此问题的元凶.非常感谢您的协助,我对此问题感到非常沮丧.

I have tried many things to work around this (as many of you may see from the code), all to no avail. I have tried splitting the large master file into smaller batches to be fed into score calculating method. I have tried del files after I am done using them, I have tried using my Ubuntu 11.11 on an Oracle Virtual machine (I typically work in 64bit Windows 7). Am I being to ambitious is this computationally feasable in BioPython? Below is my code, I have no experience in memory debugging which is the clear culprit of this problem. Any assistance is greatly appreciated I am becoming very frustrated with this problem.

最好, 哈里

    ##Open reference file
##a.)Upload subjectList
##b.)Upload query list (a and b are pairwise data)
## Cycle through each paired FASTA and get alignment score of each(Large file)

from Bio import SeqIO
from Bio import pairwise2
import gc


##BATCH ITERATOR METHOD (not my code)
def batch_iterator(iterator, batch_size) :
    entry = True #Make sure we loop once
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = iterator.next()
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
                break
            batch.append(entry)
        if batch :
            yield batch

def split(subject,query):
    ##Query Iterator and Batch Subject Iterator
    query_iterator = SeqIO.parse(query,"fasta")
    record_iter = SeqIO.parse(subject,"fasta")

    ##Writes both large file into many small files
    print "Splitting Subject File..."
    binLen=2
    for j, batch1 in enumerate(batch_iterator(record_iter, binLen)) :
        filename1="groupA_%i.fasta" % (j+1)
        handle1=open(filename1, "w")
        count1 = SeqIO.write(batch1, handle1, "fasta")
        handle1.close()

    print "Done splitting Subject file"
    print "Splitting Query File..."

    for k, batch2 in enumerate(batch_iterator(query_iterator,binLen)):
        filename2="groupB_%i.fasta" % (k+1)
        handle2=open(filename2, "w")
        count2 = SeqIO.write(batch2, handle2, "fasta")
        handle2.close()

    print "Done splitting both FASTA files"
    print " "
    return [k ,binLen]


##This file will hold the alignment scores in a tab deliminated text
f = open("C:\\Users\\Harry\\Documents\\cgigas\\alignScore.txt", 'w')

def cycle(f,k,binLen):
    i=1
    m=1
    while  i<=k+1:
        ##Open the first small file
        subjectFile = open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupA_" + str(i)+".fasta", "rU")
        queryFile =open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupB_" + str(i)+".fasta", "rU")
        i=i+1
        j=0


        ##Make small file iterators
        smallQuery=SeqIO.parse(queryFile,"fasta")
        smallSubject=SeqIO.parse(subjectFile,"fasta")

        ##Cycles through both sets of FASTA files
        while j<binLen:
                j=j+1
                currentQuery=smallQuery.next()
                currentSubject=smallSubject.next()
                ##Verify every pair is correct
                print " "
                print "Pair: " +  str(m)
                print "Subject: "+ currentSubject.id
                print "Query: " + currentQuery.id
                gc.collect()
                a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
                gc.collect()
                currentQuery=None
                currentSubject=None
                score=str(a)
                a=None
                print "Score: " + score
                f.write("1"+ "\n")
                m=m+1

        smallQuery.close()
        smallSubject.close()
        subjectFile.close()
        queryFile.close()
        gc.collect()
        print "New file"
##MAIN PROGRAM
##Here is our paired list of FASTA files

##subject = open("C:\\Users\\Harry\\Documents\\cgigas\\subjectFASTA.fasta", "rU")
##query =open("C:\\Users\\Harry\\Documents\\cgigas\\queryFASTA.fasta", "rU")
##[k,binLen]=split(subject,query)
k=272
binLen=2
cycle(f,k,binLen)

P.S.请注意,我在其中放入的代码中可能有一些愚蠢的事情,试图解决此问题.

P.S. Be kind I am aware there is probably some goofy things in the code that I put in there trying to get around this problem.

推荐答案

另请参阅有关BioStars的非常类似的问题,

See also this very similar question on BioStars, http://www.biostars.org/post/show/45893/trying-to-get-around-memoryerror-out-of-memory-exception-in-biopython-program/

我建议在那里尝试用于此类事情的现有工具,例如EMBOSSneedleall http://emboss.open-bio.org/wiki/Appdoc:Needleall(您可以使用Biopython解析EMBOSS对齐输出)

There I suggested trying existing tools for this kind of thing, e.g. EMBOSS needleall http://emboss.open-bio.org/wiki/Appdoc:Needleall (you can parse the EMBOSS alignment output with Biopython)

这篇关于(BioPython)如何停止MemoryError:内存不足异常?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆