与Matlab相比,Numpy加载csv太慢了 [英] Numpy loading csv TOO slow compared to Matlab

查看:612
本文介绍了与Matlab相比,Numpy加载csv太慢了的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我发布了这个问题,因为我想知道我是否做了一个非常错误得到这个结果。

I posted this question because I was wondering whether I did something terribly wrong to get this result.

我有一个中型csv文件,我试图使用numpy加载它。为了说明,我使用python创建了文件:

I have a medium-size csv file and I tried to use numpy to load it. For illustration, I made the file using python:

import timeit
import numpy as np

my_data = np.random.rand(1500000, 3)*10
np.savetxt('./test.csv', my_data, delimiter=',', fmt='%.2f')

然后,我尝试了两种方法:numpy.genfromtxt,numpy.loadtxt

And then, I tried two methods: numpy.genfromtxt, numpy.loadtxt

setup_stmt = 'import numpy as np'
stmt1 = """\
my_data = np.genfromtxt('./test.csv', delimiter=',')
"""
stmt2 = """\
my_data = np.loadtxt('./test.csv', delimiter=',')
"""

t1 = timeit.timeit(stmt=stmt1, setup=setup_stmt, number=3)
t2 = timeit.timeit(stmt=stmt2, setup=setup_stmt, number=3)

结果显示 t1 = 32.159652940464184,t2 = 52.00093725634724

但是,当我尝试使用matlab时:

And the result shows that t1 = 32.159652940464184, t2 = 52.00093725634724.
However, When I tried using matlab:

tic
for i = 1:3
    my_data = dlmread('./test.csv');
end
toc

结果显示:已用时间为 3.196465

The result shows: Elapsed time is 3.196465 seconds.

我了解装载速度可能有些差异,但:

I understand that there may be some differences in the loading speed, but:


  1. 这远远超过我的预期;

  2. 这不是np.loadtxt应该比np.genfromtxt快吗?

  3. 我没有尝试过python csv模块,因为加载csv文件是一个非常频繁的事情,并与csv模块,编码是一个有点冗长...但我' d乐意尝试,如果这是唯一的方法。目前我更关心这是我做错了。

任何输入都将不胜感激。非常感谢!

Any input would be appreciated. Thanks a lot in advance!

推荐答案

是的,读取 csv numpy 很慢。沿着代码路径有很多纯Python。这些天,即使我使用纯 numpy ,我仍然使用 pandas 作为IO:

Yeah, reading csv files into numpy is pretty slow. There's a lot of pure Python along the code path. These days, even when I'm using pure numpy I still use pandas for IO:

>>> import numpy as np, pandas as pd
>>> %time d = np.genfromtxt("./test.csv", delimiter=",")
CPU times: user 14.5 s, sys: 396 ms, total: 14.9 s
Wall time: 14.9 s
>>> %time d = np.loadtxt("./test.csv", delimiter=",")
CPU times: user 25.7 s, sys: 28 ms, total: 25.8 s
Wall time: 25.8 s
>>> %time d = pd.read_csv("./test.csv", delimiter=",").values
CPU times: user 740 ms, sys: 36 ms, total: 776 ms
Wall time: 780 ms

或者,在像这样简单的情况下,你可以使用像Joe Kington在这里写道:

Alternatively, in a simple enough case like this one, you could use something like what Joe Kington wrote here:

>>> %time data = iter_loadtxt("test.csv")
CPU times: user 2.84 s, sys: 24 ms, total: 2.86 s
Wall time: 2.86 s

还有Warren Weckesser的 textreader 库, pandas 是太重的依赖:

There's also Warren Weckesser's textreader library, in case pandas is too heavy a dependency:

>>> import textreader
>>> %time d = textreader.readrows("test.csv", float, ",")
readrows: numrows = 1500000
CPU times: user 1.3 s, sys: 40 ms, total: 1.34 s
Wall time: 1.34 s

这篇关于与Matlab相比,Numpy加载csv太慢了的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆