内存高效的迭代部分大文件的方法 [英] Memory-efficent way to iterate over part of a large file

查看:161
本文介绍了内存高效的迭代部分大文件的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通常会避免读取这样的文件:

I normally avoid reading files like this:

with open(file) as f:
    list_of_lines = f.readlines()

并使用此类代码。

f = open(file)
for line in file:
     #do something

除非我只需要迭代一个文件中的几行(我知道那些是哪些行)然后它认为获取list_of_lines的切片更容易。现在这又回来咬我了。我有一个巨大的文件(无法将其读入内存)但我不需要遍历所有的行只是其中几个。我已完成代码,找到我的第一行所在的位置,然后查找需要编辑的行数。我只是不知道如何编写这个循环。

Unless I only have to iterate over a few lines in a file (and I know which lines those are) then it think it is easier to take slices of the list_of_lines. Now this has come back to bite me. I have a HUGE file (reading it into memory is not possible) but I don't need to iterate over all of the lines just a few of them. I have code completed that finds where my first line is and finds how many lines after that I need to edit. I just don't have nay idea how to write this loop.

n = #grep for number of lines 
start = #pattern match the start line 
f=open('big_file')
#some loop over f from start o start + n
      #edit lines  

编辑:我的头衔可能导致辩论而不是答案。

my title may have lead to a debate rather than an answer.

推荐答案

如果我理解你的问题,你遇到的问题是存储全部列表中的文本行然后采用切片使用太多内存。你想要的是逐行读取文件,而忽略除一定行之外的所有行(例如,行 [17,34] )。

If I understand your question correctly, the problem you're encountering is that storing all the lines of text in a list and then taking a slice uses too much memory. What you want is to read the file line-by-line, while ignoring all but a certain set of lines (say, lines [17,34) for example).

尝试使用枚举来跟踪迭代文件时您所在的行号。这是一个基于生成器的方法,它使用 yield 一次只输出一个有趣的行:

Try using enumerate to keep track of which line number you're on as you iterate through the file. Here is a generator-based approach which uses yield to output the interesting lines only one at a time:

def read_only_lines(f, start, finish):
    for ii,line in enumerate(f):
        if ii>=start and ii<finish:
            yield line
        elif ii>=finish:
            return

f = open("big text file.txt", "r")
for line in read_only_lines(f, 17, 34):
    print line

read_only_lines function基本上重新实现了 itertools。来自标准库的islice ,因此您可以使用它来实现更紧凑的实现:

This read_only_lines function basically reimplements itertools.islice from the standard library, so you could use that to make an even more compact implementation:

from itertools import islice
for line in islice(f, 17, 34):
    print line

如果你想在列表而不是列表中捕获感兴趣的行生成器,只需使用列表转换它们:

If you want to capture the lines of interest in a list rather than a generator, just cast them with a list:

from itertools import islice
lines_of_interest = list( islice(f, 17, 34) )

do_something_awesome( lines_of_interest )
do_something_else( lines_of_interest )

这篇关于内存高效的迭代部分大文件的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆