在 Python 中读取输入的最快方法 [英] The fastest way to read input in Python
问题描述
我想读取一个包含整数列表列表的巨大文本文件.现在我正在做以下事情:
I want to read a huge text file that contains list of lists of integers. Now I'm doing the following:
G = []
with open("test.txt", 'r') as f:
for line in f:
G.append(list(map(int,line.split())))
但是,大约需要 17 秒(通过 timeit).有什么办法可以减少这个时间吗?也许,有一种不使用地图的方法.
However, it takes about 17 secs (via timeit). Is there any way to reduce this time? Maybe, there is a way not to use map.
推荐答案
numpy 有 loadtxt
和 genfromtxt
两个功能,但都不是特别快.pandas
(http://pandas.pydata.org/).在我的计算机上,使用 numpy.loadtxt
读取每行包含两个整数的 500 万行大约需要 46 秒,使用 numpy.genfromtxt
需要 26 秒,使用 numpy.genfromtxt
需要 1 秒多一点pandas.read_csv
.
numpy has the functions loadtxt
and genfromtxt
, but neither is particularly fast. One of the fastest text readers available in a widely distributed library is the read_csv
function in pandas
(http://pandas.pydata.org/). On my computer, reading 5 million lines containing two integers per line takes about 46 seconds with numpy.loadtxt
, 26 seconds with numpy.genfromtxt
, and a little over 1 second with pandas.read_csv
.
这是显示结果的会话.(这是在 Linux 上,Ubuntu 12.04 64 位.你在这里看不到,但是每次读取文件后,通过运行 sync; echo 3 >/proc/sys/vm/清除磁盘缓存drop_caches
在一个单独的 shell 中.)
Here's the session showing the result. (This is on Linux, Ubuntu 12.04 64 bit. You can't see it here, but after each reading of the file, the disk cache was cleared by running sync; echo 3 > /proc/sys/vm/drop_caches
in a separate shell.)
In [1]: import pandas as pd
In [2]: %timeit -n1 -r1 loadtxt('junk.dat')
1 loops, best of 1: 46.4 s per loop
In [3]: %timeit -n1 -r1 genfromtxt('junk.dat')
1 loops, best of 1: 26 s per loop
In [4]: %timeit -n1 -r1 pd.read_csv('junk.dat', sep=' ', header=None)
1 loops, best of 1: 1.12 s per loop
这篇关于在 Python 中读取输入的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!