在 python 中逐行读取一个大的压缩文本文件 [英] Read a large zipped text file line by line in python

查看:42
本文介绍了在 python 中逐行读取一个大的压缩文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 zipfile 模块读取存档中的文件.未压缩文件约为 3GB,压缩文件为 200MB.我不希望它们在内存中,因为我逐行处理压缩文件.到目前为止,我使用以下代码注意到内存过度使用:

I am trying to use zipfile module to read a file in an archive. the uncompressed file is ~3GB and the compressed file is 200MB. I don't want them in memory as I process the compressed file line by line. So far I have noticed a memory overuse using the following code:

import zipfile
f = open(...)
z = zipfile.ZipFile(f)
for line in zipfile.open(...).readlines()
  print line

我使用 SharpZipLib 在 C# 中做到了:

I did it in C# using the SharpZipLib:

var fStream = File.OpenRead("...");
var unzipper = new ICSharpCode.SharpZipLib.Zip.ZipFile(fStream);
var dataStream =  unzipper.GetInputStream(0);

dataStream 未压缩.我似乎无法在 Python 中找到一种方法.将不胜感激.

dataStream is uncompressed. I can't seem to find a way to do it in Python. Help will be appreciated.

推荐答案

Python 文件对象提供迭代器,它将逐行读取.file.readlines() 读取它们并返回一个列表 -这意味着它需要将所有内容读入内存.更好的方法(应该总是比 readlines() 更受欢迎)是只循环对象本身,例如:

Python file objects provide iterators, which will read line by line. file.readlines() reads them all and returns a list - which means it needs to read everything into memory. The better approach (which should always be preferred over readlines()) is to just loop over the object itself, E.g:

import zipfile
with zipfile.ZipFile(...) as z:
    with z.open(...) as f:
        for line in f:
            print line

注意我使用with 语句 - 文件对象是上下文管理器和 with 语句让我们可以轻松编写可读代码,确保在块退出时关闭文件(即使出现异常).同样,在处理文件时应始终使用此方法.

Note my use of the with statement - file objects are context managers, and the with statement lets us easily write readable code that ensures files are closed when the block is exited (even upon exceptions). This, again, should always be used when dealing with files.

这篇关于在 python 中逐行读取一个大的压缩文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆