Python 最快访问文件中的行 [英] Python fastest access to line in file

查看:21
本文介绍了Python 最快访问文件中的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在一个文件中有一个 ASCII 表,我想从中读取一组特定的行(例如第 4003 到 4005 行).问题是这个文件可能非常长(例如,从数千到数百万行),我想尽快完成.

糟糕的解决方案:读入整个文件,然后转到那些行,

f = open('文件名')行 = f.readlines()[4003:4005]

更好的解决方案:在每一行上枚举,这样它就不会全部在内存中(a la https://stackoverflow.com/a/2081880/230468)

f = open('文件名')行 = []对于 i,enumerate(f) 中的行:如果 i >= 4003 并且 i <= 4005:lines.append(line)如果我>4005:打破#@Wooble

最佳解决方案?

但这仍然需要遍历每一行.是否有更好的(在速度/效率方面)访问特定线路的方法?我应该使用 linecache 即使我只会访问文件一次(通常)?

改用二进制文件,在这种情况下,跳过可能更容易,这是一种选择——但我宁愿避免它.

解决方案

我可能只会使用 itertools.islice.在像文件句柄这样的可迭代对象上使用 islice 意味着永远不会将整个文件读入内存,并且会尽快丢弃前 4002 行.您甚至可以非常便宜地将您需要的两行放入一个列表中(假设这些行本身不是很长).然后你可以退出 with 块,关闭文件句柄.

from itertools import islicewith open('afile') as f:行 = 列表(islice(f, 4003, 4005))do_something_with(行)

更新

但是对于多次访问来说,圣牛是更快的线缓存.我创建了一个百万行的文件来比较 islice 和 linecache,而 linecache 把它搞砸了.

<预><代码>>>>timeit("x=islice(open('afile'), 4003, 4005); print next(x) + next(x)", 'from itertools import islice', number=1)400340040.00028586387634277344>>>timeit("print getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=1)400240032.193450927734375e-05>>>timeit("getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=10**5)0.14125394821166992>>>timeit("''.join(islice(open('afile'), 4003, 4005))", 'from itertools import islice', number=10**5)14.732316970825195

不断重新导入和重新读取文件:

这不是实际测试,但即使在每一步重新导入 linecache 也只比 islice 慢一秒.

<预><代码>>>>timeit("from linecache import getline; getline('afile', 4003) + getline('afile', 4004)", number=10**5)15.613967180252075

结论

是的,linecache 比 islice 更快,但会不断重新创建 linecache,但谁会这样做?对于可能的场景(只读取几行,一次,读取多行,一次)linecache 速度更快并且语法简洁,但是 islice 语法也非常干净和快速,并且不会永远不要将整个文件读入内存.在内存紧张的环境中,islice 解决方案可能是正确的选择.对于非常高的速度要求,linecache 可能是更好的选择.但实际上,在大多数环境中,这两个时间都足够小,这几乎无关紧要.

I have an ASCII table in a file from which I want to read a particular set of lines (e.g. lines 4003 to 4005). The issue is that this file could be very very long (e.g. 100's of thousands to millions of lines), and I'd like to do this as quickly as possible.

Bad Solution: Read in the entire file, and go to those lines,

f = open('filename')
lines = f.readlines()[4003:4005]

Better Solution: enumerate over each line so that it's not all in memory (a la https://stackoverflow.com/a/2081880/230468)

f = open('filename')
lines = []
for i, line in enumerate(f):
    if i >= 4003 and i <= 4005: lines.append(line)
    if i > 4005: break                                    # @Wooble

Best Solution?

But this still requires going through each-line. Is there a better (in terms of speed/efficiency) method of accessing a particular line? Should I use a linecache even though I will only access the file once (typically)?

Using a binary file instead, in which case it might be easier to skip-ahead, is an option --- but I'd much rather avoid it.

解决方案

I would probably just use itertools.islice. Using islice over an iterable like a file handle means the whole file is never read into memory, and the first 4002 lines are discarded as quickly as possible. You could even cast the two lines you need into a list pretty cheaply (assuming the lines themselves aren't very long). Then you can exit the with block, closing the filehandle.

from itertools import islice
with open('afile') as f:
    lines = list(islice(f, 4003, 4005))
do_something_with(lines)

Update

But holy cow is linecache faster for multiple accesses. I created a million-line file to compare islice and linecache and linecache blew it away.

>>> timeit("x=islice(open('afile'), 4003, 4005); print next(x) + next(x)", 'from itertools import islice', number=1)
4003
4004

0.00028586387634277344
>>> timeit("print getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=1)
4002
4003

2.193450927734375e-05

>>> timeit("getline('afile', 4003) + getline('afile', 4004)", 'from linecache import getline', number=10**5)
0.14125394821166992
>>> timeit("''.join(islice(open('afile'), 4003, 4005))", 'from itertools import islice', number=10**5)
14.732316970825195

Constantly re-importing and re-reading the file:

This is not a practical test, but even re-importing linecache at each step it's only a second slower than islice.

>>> timeit("from linecache import getline; getline('afile', 4003) + getline('afile', 4004)", number=10**5)
15.613967180252075

Conclusion

Yes, linecache is faster than islice for all but constantly re-creating the linecache, but who does that? For the likely scenarios (reading only a few lines, once, and reading many lines, once) linecache is faster and presents a terse syntax, but the islice syntax is quite clean and fast as well and doesn't ever read the whole file into memory. On a RAM-tight environment, the islice solution may be the right choice. For very high speed requirements, linecache may be the better choice. Practically, though, in most environments both times are small enough it almost doesn't matter.

这篇关于Python 最快访问文件中的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆