如何在python3.2中以相反顺序读取文件而不读取整个文件到内存? [英] How to read file in reverse order in python3.2 without reading the whole file to memory?

查看:197
本文介绍了如何在python3.2中以相反顺序读取文件而不读取整个文件到内存?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用python3.2来解析1到10GB大小的日志文件,需要搜索特定的正则表达式(某种时间戳),我想找到最后的发生。

我曾尝试使用:

 用于换行(list(open(filename )))

导致非常差的性能(在好的情况下)和MemoryError在坏的情况下。

在线程中:
使用python以相反顺序读取文件我没有找到任何好的答案。

我找到了以下解决方案:
蟒蛇头,尾巴和向后阅读文本文件的行
非常有前途,但是它不适用于python3.2错误:

<$ p $名称'文件'未定义

使用 File(TextIOWrapper)替换 File(file),因为这是对象内置函数但是,这导致了更多的错误(我可以详细说明,如果有人认为这是正确的方式))

这是一个可以做你自己的功能寻找

  def reverse_lines(文件名,BUFSIZE = 4096):
f = open(文件名,rb)
f.seek(0,2)
p = f.tell()
余数=
而真:
sz = min(BUFSIZE,p)
p - = sz
f.seek(p)
buf = f.read(sz)+其余
如果'\\\
'不在buf中:
remaining = buf
else:
i = buf.index('\ n')
for buf [i + 1:]。split(\ n)[:: - 1] :
产量L
余数= buf [:i]
如果p == 0:
break
产量余数

它通过从文件末尾读取一个缓冲区(默认为4kb)并反向生成所有行。然后它移回4K,并一直到文件开始。代码可能需要在内存中保持4k以上,以防在正在处理的部分中没有换行(非常长的行)。

您可以使用

  for reverse_lines(my_big_file):
...进程L ...


I am parsing log files in size of 1 to 10GB using python3.2, need to search for line with specific regex (some kind of timestamp), and I want to find the last occurance.

I have tried to use:

for line in reversed(list(open("filename")))

which resulted in very bad performance (in the good cases) and MemoryError in the bad cases.

In thread: Read a file in reverse order using python i did not find any good answer.

I have found the following solution: python head, tail and backward read by lines of a text file very promising, however it does not work for python3.2 for error:

NameError: name 'file' is not defined

I had later tried to replace File(file) with File(TextIOWrapper) as this is the object builtin function open() returns, however that had resulted in several more errors (i can elaborate if someone suggest this is the right way:))

解决方案

This is a function that does what you're looking for

def reverse_lines(filename, BUFSIZE=4096):
    f = open(filename, "rb")
    f.seek(0, 2)
    p = f.tell()
    remainder = ""
    while True:
        sz = min(BUFSIZE, p)
        p -= sz
        f.seek(p)
        buf = f.read(sz) + remainder
        if '\n' not in buf:
            remainder = buf
        else:
            i = buf.index('\n')
            for L in buf[i+1:].split("\n")[::-1]:
                yield L
            remainder = buf[:i]
        if p == 0:
            break
    yield remainder

it works by reading a buffer from the end of the file (by default 4kb) and generating all the lines in it in reverse. It then moves back by 4k and does the same until the beginning of the file. The code may need to keep more than 4k in memory in case there are no linefeed in the section being processed (very long lines).

You can use the code as

for L in reverse_lines("my_big_file"):
   ... process L ...

这篇关于如何在python3.2中以相反顺序读取文件而不读取整个文件到内存?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆