在 Python 中读取输入的最快方法 [英] The fastest way to read input in Python

查看:36
本文介绍了在 Python 中读取输入的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读取一个包含整数列表列表的巨大文本文件.现在我正在做以下事情:

I want to read a huge text file that contains list of lists of integers. Now I'm doing the following:

G = []
with open("test.txt", 'r') as f:
    for line in f:
        G.append(list(map(int,line.split())))

但是,大约需要 17 秒(通过 timeit).有什么办法可以减少这个时间吗?也许,有一种不使用地图的方法.

However, it takes about 17 secs (via timeit). Is there any way to reduce this time? Maybe, there is a way not to use map.

推荐答案

numpy 有 loadtxtgenfromtxt 两个功能,但都不是特别快.pandas (http://pandas.pydata.org/).在我的计算机上,使用 numpy.loadtxt 读取每行包含两个整数的 500 万行大约需要 46 秒,使用 numpy.genfromtxt 需要 26 秒,使用 numpy.genfromtxt 需要 1 秒多一点pandas.read_csv.

numpy has the functions loadtxt and genfromtxt, but neither is particularly fast. One of the fastest text readers available in a widely distributed library is the read_csv function in pandas (http://pandas.pydata.org/). On my computer, reading 5 million lines containing two integers per line takes about 46 seconds with numpy.loadtxt, 26 seconds with numpy.genfromtxt, and a little over 1 second with pandas.read_csv.

这是显示结果的会话.(这是在 Linux 上,Ubuntu 12.04 64 位.你在这里看不到,但是每次读取文件后,通过运行 sync; echo 3 >/proc/sys/vm/清除磁盘缓存drop_caches 在一个单独的 shell 中.)

Here's the session showing the result. (This is on Linux, Ubuntu 12.04 64 bit. You can't see it here, but after each reading of the file, the disk cache was cleared by running sync; echo 3 > /proc/sys/vm/drop_caches in a separate shell.)

In [1]: import pandas as pd

In [2]: %timeit -n1 -r1 loadtxt('junk.dat')
1 loops, best of 1: 46.4 s per loop

In [3]: %timeit -n1 -r1 genfromtxt('junk.dat')
1 loops, best of 1: 26 s per loop

In [4]: %timeit -n1 -r1 pd.read_csv('junk.dat', sep=' ', header=None)
1 loops, best of 1: 1.12 s per loop

这篇关于在 Python 中读取输入的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆