如何有效地删除大文件的第一行? [英] How to efficiently remove the first line of a large file?

查看:91
本文介绍了如何有效地删除大文件的第一行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

已经问过这个问题此处此处 ,但没有一种解决方案对我有用.

This question has already been asked here and here, but none of the solutions worked for me.

如何在Python 3中有效地从大文件中删除第一行?

How do I remove the first line from a large file efficiently in Python 3?

我正在编写一个需要日志记录的程序,并且日志文件具有可配置的最大大小,该大小可以是无限的.因此,我不想使用readlines()或类似方法,因为它们会占用大量内存.速度并不是什么大问题,但是如果可以做到无需重写 entire 文件,并且没有临时文件,那就太好了.

I am writing a program which requires logging, and the log file has a configurable maximum size, which could be infinite. Therefore, I do not want to use readlines() or similar methods as these would be memory intensive. Speed is not a huge concern, but if it can be done without rewriting the entire file, and without temporary files, that would be great.

解决方案需要跨平台.

示例日志文件:

[09:14:56 07/04/17] [INFO] foo
[23:45:01 07/04/17] [WARN] bar
[13:45:28 08/04/17] [INFO] foobar
... many thousands more lines

输出:

[23:45:01 07/04/17] [WARN] bar
[13:45:28 08/04/17] [INFO] foobar
... many thousands more lines

此代码将循环运行:

while os.path.getsize(LOGFILE) > MAXLOGSIZE:
    # remove first line of file

以下任何一种解决方案都不起作用,并且内存效率很高:

解决方案#1-可行,但效率低下

Solution #1 - works but inefficient

with open('file.txt', 'r') as fin:
    data = fin.read().splitlines(True)
with open('file.txt', 'w') as fout:
    fout.writelines(data[1:])

解决方案2-不起作用,将文件留空

Solution #2 - doesn't work, leaves file empty

import shutil

source_file = open('file.txt', 'r')
source_file.readline()
target_file = open('file.txt', 'w')

shutil.copyfileobj(source_file, target_file)

解决方案#3-高效,有效,但使用了其他文件:

Solution #3 - works, efficient, but uses additional file:

with open("file.txt",'r') as f:
    with open("new_file.txt",'w') as f1:
        f.next() # skip header line
        for line in f:
            f1.write(line)

推荐答案

因此,这种方法很容易破解.如果您的线径大约相同,且标准偏差较小,则效果会很好.想法是将文件的某些部分读入缓冲区,该缓冲区要小到可以提高内存效率,但又要足够大,以至于写两端都不会弄乱事情(因为行的大小大致相同,几乎没有差异,所以我们可以我们的手指并祈祷它会起作用).我们基本上会跟踪文件中的位置,然后来回跳转.我使用collections.deque作为缓冲区,因为它在两端都具有良好的append性能,并且我们可以利用队列的FIFO性质:

So, this approach is very hacky. It will work well if your line-sizes are about the same size with a small standard deviation. The idea is to read some portion of your file into a buffer that is small enough to be memory efficient but large enough that writing form both ends will not mess things up (since the lines are roughly the same size with little variance, we can cross our fingers and pray that it will work). We basically keep track of where we are in the file and jump back and forth. I use a collections.deque as a buffer because it has favorable append performance from both ends, and we can take advantage of the FIFO nature of a queue:

from collections import deque
def efficient_dropfirst(f, dropfirst=1, buffersize=3):
    f.seek(0)
    buffer = deque()
    tail_pos = 0
    # these next two loops assume the file has many thousands of
    # lines so we can safely drop and buffer the first few...
    for _ in range(dropfirst):
        f.readline()
    for _ in range(buffersize):
        buffer.append(f.readline())
    line = f.readline()
    while line:
        buffer.append(line)
        head_pos = f.tell()
        f.seek(tail_pos)
        tail_pos += f.write(buffer.popleft())
        f.seek(head_pos)
        line = f.readline()
    f.seek(tail_pos)
    # finally, clear out the buffer:
    while buffer:
        f.write(buffer.popleft())
    f.truncate()

现在,让我们用一个表现良好的假装文件来尝试一下:

Now, let's try this out with a pretend file that behaves nicely:

>>> s = """1. the quick
... 2. brown fox
... 3. jumped over
... 4. the lazy
... 5. black dog.
... 6. Old McDonald's
... 7. Had a farm
... 8. Eeyi Eeeyi Oh
... 9. And on this farm they had a
... 10. duck
... 11. eeeieeeiOH
... """

最后:

>>> import io
>>> with io.StringIO(s) as f: # we mock a file
...     efficient_dropfirst(f)
...     final = f.getvalue()
...
>>> print(final)
2. brown fox
3. jumped over
4. the lazy
5. black dog.
6. Old McDonald's
7. Had a farm
8. Eeyi Eeeyi Oh
9. And on this farm they had a
10. duck
11. eeeieeeiOH

如果dropfirst< buffersize有点松弛".由于您只想删除第一行,因此只需保留dropfirst=1,然后您就可以设置buffersize=100或只是为了安全起见.与读取数千行"相比,它的内存效率要高得多,并且如果没有一行比以前的行大,那么您应该是安全的.但请注意,这在边缘很粗糙.

This should work out OK if dropfirst < buffersize by a good bit of "slack". Since you only want to drop the first line, just keep dropfirst=1, and you can maybe make buffersize=100 or something just to be safe. It will be much more memory efficient than reading "many thousands of lines", and if no single line is bigger than the previous lines, you should be safe. But be warned, this is very rough around the edges.

这篇关于如何有效地删除大文件的第一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆