Python-如何在没有MemoryError的情况下gzip压缩大型文本文件? [英] Python - How to gzip a large text file without MemoryError?

查看:187
本文介绍了Python-如何在没有MemoryError的情况下gzip压缩大型文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用以下简单的Python脚本在EC2 m3.large实例上压缩大型文本文件(例如, 10GB ).但是,我总是得到一个MemoryError:

I use the following simple Python script to compress a large text file (say, 10GB) on an EC2 m3.large instance. However, I always got a MemoryError:

import gzip

with open('test_large.csv', 'rb') as f_in:
    with gzip.open('test_out.csv.gz', 'wb') as f_out:
        f_out.writelines(f_in)
        # or the following:
        # for line in f_in:
        #     f_out.write(line)

我得到的回溯是:

Traceback (most recent call last):
  File "test.py", line 8, in <module>
    f_out.writelines(f_in)
MemoryError

我已经阅读了有关此问题的一些讨论,但仍不太清楚如何处理此问题.有人可以给我一个关于如何解决这个问题的更容易理解的答案吗?

I have read some discussion about this issue, but still not quite clear how to handle this. Can someone give me a more understandable answer about how to deal with this problem?

推荐答案

这很奇怪.如果您尝试压缩不包含许多换行符的大型二进制文件,则可能会出现此错误,因为这样的文件可能包含对您的RAM而言太大的行",但不应在行上发生-结构化的.csv文件.

That's odd. I would expect this error if you tried to compress a large binary file that didn't contain many newlines, since such a file could contain a "line" that was too big for your RAM, but it shouldn't happen on a line-structured .csv file.

但是无论如何,逐行压缩文件并不是很有效.即使操作系统缓冲磁盘I/O,读写大型数据块(例如64 kB)的速度通常快得多.

But anyway, it's not very efficient to compress files line by line. Even though the OS buffers disk I/O it's generally much faster to read and write larger blocks of data, eg 64 kB.

我在这台计算机上有2GB的RAM,并且我刚刚成功使用下面的程序压缩了2.8GB的tar存档.

I have 2GB of RAM on this machine, and I just successfully used the program below to compress a 2.8GB tar archive.

#! /usr/bin/env python

import gzip
import sys

blocksize = 1 << 16     #64kB

def gzipfile(iname, oname, level):
    with open(iname, 'rb') as f_in:
        f_out = gzip.open(oname, 'wb', level)
        while True:
            block = f_in.read(blocksize)
            if block == '':
                break
            f_out.write(block)
        f_out.close()
    return


def main():
    if len(sys.argv) < 3:
        print "gzip compress in_file to out_file"
        print "Usage:\n%s in_file out_file [compression_level]" % sys.argv[0]
        exit(1)

    iname = sys.argv[1]
    oname = sys.argv[2]
    level = int(sys.argv[3]) if len(sys.argv) > 3 else 6

    gzipfile(iname, oname, level)


if __name__ == '__main__':  
    main()

我正在运行Python 2.6.6,并且gzip.open()不支持with.

I'm running Python 2.6.6 and gzip.open() doesn't support with.

正如安德鲁·贝(Andrew Bay)在评论中指出的那样,if block == '':在Python 3中无法正常工作,因为block包含字节,而不是字符串,并且空字节对象不等于等于空文本字符串. .我们可以可以检查块的长度,或者与b''进行比较(在Python 2.6+中也可以使用),但是简单的方法是if not block:.

As Andrew Bay notes in the comments, if block == '': won't work correctly in Python 3, since block contains bytes, not a string, and an empty bytes object doesn't compare as equal to an empty text string. We could check the block length, or compare to b'' (which will also work in Python 2.6+), but the simple way is if not block:.

这篇关于Python-如何在没有MemoryError的情况下gzip压缩大型文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆