如何有效地删除大文件的第一行? [英] How to efficiently remove the first line of a large file?
问题描述
This question has already been asked here and here, but none of the solutions worked for me.
如何在Python 3中有效地从大文件中删除第一行?
How do I remove the first line from a large file efficiently in Python 3?
我正在编写一个需要日志记录的程序,并且日志文件具有可配置的最大大小,该大小可以是无限的.因此,我不想使用readlines()
或类似方法,因为它们会占用大量内存.速度并不是什么大问题,但是如果可以做到无需重写 entire 文件,并且没有临时文件,那就太好了.
I am writing a program which requires logging, and the log file has a configurable maximum size, which could be infinite. Therefore, I do not want to use readlines()
or similar methods as these would be memory intensive. Speed is not a huge concern, but if it can be done without rewriting the entire file, and without temporary files, that would be great.
解决方案需要跨平台.
示例日志文件:
[09:14:56 07/04/17] [INFO] foo
[23:45:01 07/04/17] [WARN] bar
[13:45:28 08/04/17] [INFO] foobar
... many thousands more lines
输出:
[23:45:01 07/04/17] [WARN] bar
[13:45:28 08/04/17] [INFO] foobar
... many thousands more lines
此代码将循环运行:
while os.path.getsize(LOGFILE) > MAXLOGSIZE:
# remove first line of file
以下任何一种解决方案都不起作用,并且内存效率很高:
解决方案#1-可行,但效率低下
Solution #1 - works but inefficient
with open('file.txt', 'r') as fin:
data = fin.read().splitlines(True)
with open('file.txt', 'w') as fout:
fout.writelines(data[1:])
解决方案2-不起作用,将文件留空
Solution #2 - doesn't work, leaves file empty
import shutil
source_file = open('file.txt', 'r')
source_file.readline()
target_file = open('file.txt', 'w')
shutil.copyfileobj(source_file, target_file)
解决方案#3-高效,有效,但使用了其他文件:
Solution #3 - works, efficient, but uses additional file:
with open("file.txt",'r') as f:
with open("new_file.txt",'w') as f1:
f.next() # skip header line
for line in f:
f1.write(line)
推荐答案
因此,这种方法很容易破解.如果您的线径大约相同,且标准偏差较小,则效果会很好.想法是将文件的某些部分读入缓冲区,该缓冲区要小到可以提高内存效率,但又要足够大,以至于写两端都不会弄乱事情(因为行的大小大致相同,几乎没有差异,所以我们可以我们的手指并祈祷它会起作用).我们基本上会跟踪文件中的位置,然后来回跳转.我使用collections.deque
作为缓冲区,因为它在两端都具有良好的append
性能,并且我们可以利用队列的FIFO性质:
So, this approach is very hacky. It will work well if your line-sizes are about the same size with a small standard deviation. The idea is to read some portion of your file into a buffer that is small enough to be memory efficient but large enough that writing form both ends will not mess things up (since the lines are roughly the same size with little variance, we can cross our fingers and pray that it will work). We basically keep track of where we are in the file and jump back and forth. I use a collections.deque
as a buffer because it has favorable append
performance from both ends, and we can take advantage of the FIFO nature of a queue:
from collections import deque
def efficient_dropfirst(f, dropfirst=1, buffersize=3):
f.seek(0)
buffer = deque()
tail_pos = 0
# these next two loops assume the file has many thousands of
# lines so we can safely drop and buffer the first few...
for _ in range(dropfirst):
f.readline()
for _ in range(buffersize):
buffer.append(f.readline())
line = f.readline()
while line:
buffer.append(line)
head_pos = f.tell()
f.seek(tail_pos)
tail_pos += f.write(buffer.popleft())
f.seek(head_pos)
line = f.readline()
f.seek(tail_pos)
# finally, clear out the buffer:
while buffer:
f.write(buffer.popleft())
f.truncate()
现在,让我们用一个表现良好的假装文件来尝试一下:
Now, let's try this out with a pretend file that behaves nicely:
>>> s = """1. the quick
... 2. brown fox
... 3. jumped over
... 4. the lazy
... 5. black dog.
... 6. Old McDonald's
... 7. Had a farm
... 8. Eeyi Eeeyi Oh
... 9. And on this farm they had a
... 10. duck
... 11. eeeieeeiOH
... """
最后:
>>> import io
>>> with io.StringIO(s) as f: # we mock a file
... efficient_dropfirst(f)
... final = f.getvalue()
...
>>> print(final)
2. brown fox
3. jumped over
4. the lazy
5. black dog.
6. Old McDonald's
7. Had a farm
8. Eeyi Eeeyi Oh
9. And on this farm they had a
10. duck
11. eeeieeeiOH
如果dropfirst
< buffersize
有点松弛".由于您只想删除第一行,因此只需保留dropfirst=1
,然后您就可以设置buffersize=100
或只是为了安全起见.与读取数千行"相比,它的内存效率要高得多,并且如果没有一行比以前的行大,那么您应该是安全的.但请注意,这在边缘很粗糙.
This should work out OK if dropfirst
< buffersize
by a good bit of "slack". Since you only want to drop the first line, just keep dropfirst=1
, and you can maybe make buffersize=100
or something just to be safe. It will be much more memory efficient than reading "many thousands of lines", and if no single line is bigger than the previous lines, you should be safe. But be warned, this is very rough around the edges.
这篇关于如何有效地删除大文件的第一行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!