将用-v7.3(HDF5)保存的Matlab稀疏矩阵加载到Python中并对其进行操作 [英] Loading Matlab sparse matrix saved with -v7.3 (HDF5) into Python and operating on it
问题描述
我是python的新手,来自matlab.我有一个以Matlab v7.3(HDF5)格式保存的大型稀疏矩阵.到目前为止,我已经找到了两种使用h5py
和tables
加载文件的方式.但是,在任何一种情况下,在矩阵上进行操作似乎都非常缓慢.例如,在matlab中:
I'm new to python, coming from matlab. I have a large sparse matrix saved in matlab v7.3 (HDF5) format. I've so far found two ways of loading in the file, using h5py
and tables
. However operating on the matrix seems to be extremely slow after either. For example, in matlab:
>> whos
Name Size Bytes Class Attributes
M 11337x133338 77124408 double sparse
>> tic, sum(M(:)); toc
Elapsed time is 0.086233 seconds.
使用表格:
t = time.time()
sum(f.root.M.data)
elapsed = time.time() - t
print elapsed
35.929461956
使用h5py:
t = time.time()
sum(f["M"]["data"])
elapsed = time.time() - t
print elapsed
(我放弃了等待...)
(I gave up waiting ...)
基于@bpgergo的注释,我应该补充一点,在以下两个示例中,我已尝试将h5py
(f
)加载的结果转换为numpy
数组或scipy
稀疏数组方式:
Based on the comments from @bpgergo, I should add that I've tried converting the result loaded in by h5py
(f
) into a numpy
array or a scipy
sparse array in the following two ways:
from scipy import sparse
A = sparse.csc_matrix((f["M"]["data"], f["M"]["ir"], f["tfidf"]["jc"]))
或
data = numpy.asarray(f["M"]["data"])
ir = numpy.asarray(f["M"]["ir"])
jc = numpy.asarray(f["M"]["jc"])
A = sparse.coo_matrix(data, (ir, jc))
,但是这两个操作都非常慢.
but both of these operations are extremely slow as well.
这里有什么我想念的吗?
Is there something I'm missing here?
推荐答案
您的大部分问题是您在实际上是内存映射数组(即,它在磁盘上,而不在内存中)上使用python sum
的情况.
Most of your problem is that you're using python sum
on what's effectively a memory-mapped array (i.e. it's on disk, not in memory).
首先,您正在比较从磁盘读取内容所需的时间与读取内存中内容所需的时间.如果要与在matlab中所做的比较,请先将数组加载到内存中.
First off, you're comparing the time it takes to read things from disk to the time it takes to read things in memory. Load the array into memory first, if you want to compare to what you're doing in matlab.
第二,python内置的sum
对于numpy数组非常无效. (或者,相反,独立地遍历numpy数组的每个项目非常慢,这是python内置的sum
所做的.)对于numpy数组,请使用numpy.sum(yourarray)
或yourarray.sum()
.
Secondly, python's builtin sum
is very inefficent for numpy arrays. (Or, rather, iterating through every item of a numpy array independently is very slow, which is what python's builtin sum
is doing.) Use numpy.sum(yourarray)
or yourarray.sum()
instead for numpy arrays.
例如:
(使用h5py
,因为我比较熟悉.)
(Using h5py
, because I'm more familiar with it.)
import h5py
import numpy as np
f = h5py.File('yourfile.hdf', 'r')
dataset = f['/M/data']
# Load the entire array into memory, like you're doing for matlab...
data = np.empty(dataset.shape, dataset.dtype)
dataset.read_direct(data)
print data.sum() #Or alternately, "np.sum(data)"
这篇关于将用-v7.3(HDF5)保存的Matlab稀疏矩阵加载到Python中并对其进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!