内存错误Python逐行处理大文件 [英] Memory Error Python Processing Large File Line by Line

查看:210
本文介绍了内存错误Python逐行处理大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试连接模型输出文件,由于软件输出到文件的方式从每个文件的0开始重新标记,因此模型运行被分成5个部分,每个输出对应于部分运行之一.输出.我写了一些代码到:

I am trying to concatenate model output files, the model run was broken up in 5 and each output corresponds to one of those partial run, due to the way the software outputs to file it start relabelling from 0 on each of the file outputs. I wrote some code to:

1)将所有输出文件连接在一起 2)编辑合并后的文件以重新标记所有时间步,从0开始,每增加一个.

1) concatenate all the output files together 2) edit the merged file to re-label all timesteps, starting at 0 and increasing by an increment at each one.

目标是我可以将一个文件加载到我的可视化软件中,而不是打开5个不同的窗口.

The aim is that I can load this single file into my visualization software in one chunk, rather than open 5 different windows.

到目前为止,由于我正在处理的文件很大,我的代码引发了内存错误.

So far my code throws a memory error due to the large files I am dealing with.

我对如何尝试摆脱它有一些想法,但是我不确定什么会起作用或/或者可能会使事情变慢.

I have a few ideas of how I could try and get rid of it but I'm not sure what will work or/and might slow things down to a crawl.

到目前为止的代码:

import os
import time

start_time = time.time()

#create new txt file in smae folder as python script

open("domain.txt","w").close()


"""create concatenated document of all tecplot output files"""
#look into file number 1

for folder in range(1,6,1): 
    folder = str(folder)
    for name in os.listdir(folder):
        if "domain" in name:
            with open(folder+'/'+name) as file_content_list:
                start = ""
                for line in file_content_list:
                    start = start + line# + '\n' 
                with open('domain.txt','a') as f:
                    f.write(start)
              #  print start

#identify file with "domain" in name
#extract contents
#append to the end of the new document with "domain" in folder level above
#once completed, add 1 to the file number previously searched and do again
#keep going until no more files with a higher number exist

""" replace the old timesteps with new timesteps """
#open folder named domain.txt
#Look for lines:
##ZONE T="0.000000000000e+00s", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL
##STRANDID=1, SOLUTIONTIME=0.000000000000e+00
# if they are found edits them, otherwise copy the line without alteration

with open("domain.txt", "r") as combined_output:
    start = ""
    start_timestep = 0
    time_increment = 3.154e10
    for line in combined_output:
        if "ZONE" in line:
            start = start + 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'
        elif "STRANDID" in line:
            start = start + 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'
            start_timestep = start_timestep + time_increment
        else:
            start = start + line

    with open('domain_final.txt','w') as f:
        f.write(start)

end_time = time.time()
print 'runtime : ', end_time-start_time

os.remove("domain.txt")

到目前为止,我在串联阶段遇到了内存错误.

So far, I get the memory error at the concatenation stage.

要改善,我可以:

1)读取每个文件时,请尝试随时进行更正,但是由于它已经无法遍历整个文件,因此我认为除了计算时间之外,没有什么大不同

1) Try and do the corrections on the go as I read each file, but since it's already failing to go through an entire one I don't think that would make much of a difference other than computing time

2)将所有文件作为数组加载,并执行检查功能,然后在该数组上运行该功能:

2) Load all the file as into an array and make a function of the checks and run that function on the array:

类似的东西:

def do_correction(line):
        if "ZONE" in line:
            return 'ZONE T="' + str(start_timestep) + 's", N=87715, E=173528, F=FEPOINT, ET=QUADRILATERAL' + '\n'
        elif "STRANDID" in line:
            return 'STRANDID=1, SOLUTIONTIME=' + str(start_timestep) + '\n'
        else:
            return line

3)保持原样,并要求Python指出何时该内存即将用完并在该阶段写入文件.有人知道这是否可能吗?

3) keep it as is and ask Python to indicate when it is about to run out of memory and write to the file at that stage. Anyone knows if that is possible ?

谢谢您的帮助

推荐答案

在写入输出文件之前,不必将每个文件的全部内容读入内存.大文件只会占用(可能全部)可用内存.

It is not necessary to read the entire contents of each file into memory before writing to the output file. Large files will just consume, possibly all, available memory.

一次只读取和写入一行.还要仅打开输出文件一次...并选择一个不会被选择并用作输入文件本身的名称,否则您将冒着将输出文件连接到自身的风险(这不是问题,但如果您还可以处理当前目录中的文件)-如果加载该文件还不会消耗所有内存.

Simply read and write one line at a time. Also open the output file once only... and choose a name that will not be picked up and treated as an input file itself, otherwise you run the risk of concatenating the output file onto itself (not yet a problem, but could be if you also process files from the current directory) - if loading it doesn't already consume all memory.

import os.path

with open('output.txt', 'w') as outfile:
    for folder in range(1,6,1): 
        for name in os.listdir(folder):
            if "domain" in name:
                with open(os.path.join(str(folder), name)) as file_content_list:
                    for line in file_content_list:
                        # perform corrections/modifications to line here
                        outfile.write(line)

现在您可以以面向行的方式处理数据-只需在写入输出文件之前对其进行修改即可.

Now you can process the data in a line oriented manner - just modify it before writing to the output file.

这篇关于内存错误Python逐行处理大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆