在Python中遍历大文件需要花费数小时 [英] Looping through big files takes hours in Python

查看:131
本文介绍了在Python中遍历大文件需要花费数小时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是我第二天在Python中工作.我用C ++做了一段时间,但决定尝试使用Python.我的程序按预期工作.但是,当我一次处理一个文件而没有glob循环时,每个文件大约需要半小时.当我包含glob时,循环大约需要12个小时来处理8个文件.

This is my second day working in Python .I worked on this in C++ for a while, but decided to try Python. My program works as expected. However, when I process one file at a time without the glob loop, it takes about a half hour per file. When I include the glob, the loop takes about 12 hours to process 8 files.

我的问题是,我的程序中肯定有什么在减慢速度吗?我应该做些什么来使其更快?

My question is this, is there anything in my program that is definitely slowing it down? is there anything I should be doing to make it faster?

我有一个大文件文件夹.例如

I have a folder of large files. For example

file1.txt(6gb) file2.txt(5.5gb) file3.txt(6gb)

file1.txt (6gb) file2.txt (5.5gb) file3.txt (6gb)

如果有帮助,每一行数据都以一个字符开头,该字符告诉我其余字符的格式,这就是为什么我拥有所有if elif语句的原因.一行数据如下所示: T35201 M352 RZNGA AC

If it helps, each line of data begins with a character that tells me how the rest of the characters are formatted, which is why I have all of the if elif statements. A line of data would look like this: T35201 M352 RZNGA AC

我试图读取每个文件,使用拆分进行一些解析,然后保存文件.

I am trying to read each file, do some parsing using splits, and then save the file.

计算机具有32gb的ram,所以我的方法是将每个文件读入ram,然后循环浏览该文件,然后保存,清除ram以获取下一个文件.

The computer has 32gb of ram, so my method is to read each file into ram, and then loop through the file, and then save, clearing ram for the next file.

我已包含该文件,因此您可以看到我正在使用的方法.我使用if elif语句,该语句使用大约10个不同的elif命令.我已经尝试过字典,但是我想不起来要挽救我的性命.

I've included the file so you can see the methods that I am using. I use an if elif statement that uses about 10 different elif commands. I have tried a dictionary, but I couldn't figure that out to save my life.

任何答案都是有帮助的.

Any answers would be helpful.

import csv
import glob

for filename in glob.glob("/media/3tb/5may/*.txt"):
    f = open(filename,'r')
    c = csv.writer(open(filename + '.csv','wb'))

    second=0
    mill=0
    for line in f.readlines():
       #print line
        event=0
        ticker=0
        marketCategory=0
        variable = line[0:1]    

        if variable is 'T':
           second = line[1:6]
           mill=0
        else: 
           second = second 

        if variable is 'R':
           ticker = line[1:7]   
           marketCategory = line[7:8]
        elif variable is ...
        elif variable is ...
        elif ...
        elif ...
        elif ...
        elif ...
        elif        

        if variable (!= 'T') and (!= 'M')
            c.writerow([second,mill,event ....]) 
   f.close()

更新 每个elif语句几乎相同.唯一改变的部分是我划分线的方式.这是两个elif语句(共有13条语句,除了拆分方式外,它们几乎都是相同的.)

UPDATE Each of the elif statements are nearly identical. The only parts that change are the ways that I split the lines. Here are two elif statements (There are 13 total, and they are almost all identical except for the way that they are split.)

  elif variable is 'C':
     order = line[1:10]
     Shares = line[10:16]
     match = line[16:25]
     printable = line[25:26]
     price = line[26:36]
   elif variable is 'P':
     ticker = line[17:23]
     order = line[1:10]
     buy = line[10:11]
     shares = line[11:17]
     price = line[23:33]
     match = line[33:42]

UPDATE2 我已经使用for file in f两次运行了代码.我第一次运行单个文件而没有 for filename in glob.glob("/media/3tb/file.txt"):时,花了大约30分钟的时间来手动编码一个文件的文件路径.

UPDATE2 I have ran the code using for file in f two different times. The first time I ran a single file without for filename in glob.glob("/media/3tb/file.txt"): and it took about 30 minutes manually coding the file path for one file.

我再次使用 for filename in glob.glob("/media/3tb/*file.txt")运行了该文件,只花了一个小时的时间就找到了该文件夹中的一个文件.全局代码会增加这么多时间吗?

I ran it again with for filename in glob.glob("/media/3tb/*file.txt") and it took an hour just for one file in the folder. Does the glob code add that much time?

推荐答案

此处:

for line in f.readlines():

您应该这样做:

for line in f:

前者将整个文件读入行列表,然后遍历该列表.后者以增量方式执行此操作,这将大大减少分配的总内存,并在以后由程序释放.

The former reads the entire file into a list of lines, then iterates over that list. The latter does it incrementally, which should drastically reduce the total memory allocated and later freed by your program.

这篇关于在Python中遍历大文件需要花费数小时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆