逐行处理非常大(> 20GB)的文本文件 [英] Process very large (>20GB) text file line by line

查看:96
本文介绍了逐行处理非常大(> 20GB)的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有很多非常大的文本文件需要处理,最大的文件大小约为60GB.

I have a number of very large text files which I need to process, the largest being about 60GB.

每行在七个字段中有54个字符,我想从前三个字段中的每一个中删除最后三个字符-这将使文件大小减少约20%.

Each line has 54 characters in seven fields and I want to remove the last three characters from each of the first three fields - which should reduce the file size by about 20%.

我是Python的新手,并且有一个代码可以每小时约3.4 GB的速度完成我想做的事情,但是要进行一次有价值的练习,我真的需要至少每小时10 GB-有没有加快速度的方法?这段代码几乎无法挑战我的处理器,因此我毫无根据地猜测,它受到内部硬盘驱动器读写速度的限制吗?

I am brand new to Python and have a code which will do what I want to do at about 3.4 GB per hour, but to be a worthwhile exercise I really need to be getting at least 10 GB/hr - is there any way to speed this up? This code doesn't come close to challenging my processor, so I am making an uneducated guess that it is limited by the read and write speed to the internal hard drive?

def ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    l = r.readline()
    while l:
        x = l.split(' ')[0]
        y = l.split(' ')[1]
        z = l.split(' ')[2]
        w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
        l = r.readline()
    r.close()
    w.close()

任何帮助将不胜感激.我在Windows 7上使用IDLE Python GUI并具有16GB的内存-也许使用其他操作系统会更有效?.

Any help would be really appreciated. I am using the IDLE Python GUI on Windows 7 and have 16GB of memory - perhaps a different OS would be more efficient?.

编辑:这是要处理的文件的摘录.

Here is an extract of the file to be processed.

70700.642014 31207.277115 -0.054123 -1585 255 255 255
70512.301468 31227.990799 -0.255600 -1655 155 158 158
70515.727097 31223.828659 -0.066727 -1734 191 187 180
70566.756699 31217.065598 -0.205673 -1727 254 255 255
70566.695938 31218.030807 -0.047928 -1689 249 251 249
70536.117874 31227.837662 -0.033096 -1548 251 252 252
70536.773270 31212.970322 -0.115891 -1434 155 158 163
70533.530777 31215.270828 -0.154770 -1550 148 152 156
70533.555923 31215.341599 -0.138809 -1480 150 154 158

推荐答案

像这样编写代码更惯用

def ProcessLargeTextFile():
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

这里的主要节省是只执行一次split,但是如果不对CPU征税,这可能会产生很小的差异

The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

一次可以节省几千行,并一键写入,以减少硬盘驱动器的抖动.一百万行只有 54MB RAM!

It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

@Janne建议,这是生成行的另一种方法

suggested by @Janne, an alternative way to generate the lines

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z, rest = line.split(' ', 3)
            bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

这篇关于逐行处理非常大(> 20GB)的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆