用Python读取输入的最快方法 [英] The fastest way to read input in Python
问题描述
我想阅读一个巨大的文本文件,其中包含整数列表. 现在,我正在执行以下操作:
I want to read a huge text file that contains list of lists of integers. Now I'm doing the following:
G = []
with open("test.txt", 'r') as f:
for line in f:
G.append(list(map(int,line.split())))
但是,大约需要17秒(通过timeit).有什么办法可以减少这个时间?也许有一种不使用地图的方法.
However, it takes about 17 secs (via timeit). Is there any way to reduce this time? Maybe, there is a way not to use map.
推荐答案
numpy具有函数loadtxt
和genfromtxt
,但是它们都不是特别快.分布广泛的库中可用的最快的文本阅读器之一是pandas
中的read_csv
函数( http://pandas .pydata.org/).在我的计算机上,使用numpy.loadtxt
读取500万行包含每行两个整数的行大约需要46秒,使用numpy.genfromtxt
大约需要26秒,使用pandas.read_csv
大约需要1秒.
numpy has the functions loadtxt
and genfromtxt
, but neither is particularly fast. One of the fastest text readers available in a widely distributed library is the read_csv
function in pandas
(http://pandas.pydata.org/). On my computer, reading 5 million lines containing two integers per line takes about 46 seconds with numpy.loadtxt
, 26 seconds with numpy.genfromtxt
, and a little over 1 second with pandas.read_csv
.
这是显示结果的会话. (这是在Linux,Ubuntu 12.04 64位上.您在这里看不到它,但是在每次读取文件后,通过在单独的shell中运行sync; echo 3 > /proc/sys/vm/drop_caches
清除了磁盘缓存.)
Here's the session showing the result. (This is on Linux, Ubuntu 12.04 64 bit. You can't see it here, but after each reading of the file, the disk cache was cleared by running sync; echo 3 > /proc/sys/vm/drop_caches
in a separate shell.)
In [1]: import pandas as pd
In [2]: %timeit -n1 -r1 loadtxt('junk.dat')
1 loops, best of 1: 46.4 s per loop
In [3]: %timeit -n1 -r1 genfromtxt('junk.dat')
1 loops, best of 1: 26 s per loop
In [4]: %timeit -n1 -r1 pd.read_csv('junk.dat', sep=' ', header=None)
1 loops, best of 1: 1.12 s per loop
这篇关于用Python读取输入的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!