将用-v7.3(HDF5)保存的Matlab稀疏矩阵加载到Python中并对其进行操作 [英] Loading Matlab sparse matrix saved with -v7.3 (HDF5) into Python and operating on it

查看:267
本文介绍了将用-v7.3(HDF5)保存的Matlab稀疏矩阵加载到Python中并对其进行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是python的新手,来自matlab.我有一个以Matlab v7.3(HDF5)格式保存的大型稀疏矩阵.到目前为止,我已经找到了两种使用h5pytables加载文件的方式.但是,在任何一种情况下,在矩阵上进行操作似乎都非常缓慢.例如,在matlab中:

I'm new to python, coming from matlab. I have a large sparse matrix saved in matlab v7.3 (HDF5) format. I've so far found two ways of loading in the file, using h5py and tables. However operating on the matrix seems to be extremely slow after either. For example, in matlab:

>> whos     
  Name           Size                   Bytes  Class     Attributes

  M      11337x133338            77124408  double    sparse    

>> tic, sum(M(:)); toc
Elapsed time is 0.086233 seconds.

使用表格:

t = time.time()
sum(f.root.M.data)
elapsed = time.time() - t
print elapsed
35.929461956

使用h5py:

t = time.time()
sum(f["M"]["data"])
elapsed = time.time() - t
print elapsed

(我放弃了等待...)

(I gave up waiting ...)

基于@bpgergo的注释,我应该补充一点,在以下两个示例中,我已尝试将h5py(f)加载的结果转换为numpy数组或scipy稀疏数组方式:

Based on the comments from @bpgergo, I should add that I've tried converting the result loaded in by h5py (f) into a numpy array or a scipy sparse array in the following two ways:

from scipy import sparse
A = sparse.csc_matrix((f["M"]["data"], f["M"]["ir"], f["tfidf"]["jc"]))

data = numpy.asarray(f["M"]["data"])
ir = numpy.asarray(f["M"]["ir"])
jc = numpy.asarray(f["M"]["jc"])    
    A = sparse.coo_matrix(data, (ir, jc))

,但是这两个操作都非常慢.

but both of these operations are extremely slow as well.

这里有什么我想念的吗?

Is there something I'm missing here?

推荐答案

您的大部分问题是您在实际上是内存映射数组(即,它在磁盘上,而不在内存中)上使用python sum的情况.

Most of your problem is that you're using python sum on what's effectively a memory-mapped array (i.e. it's on disk, not in memory).

首先,您正在比较从磁盘读取内容所需的时间与读取内存中内容所需的时间.如果要与在matlab中所做的比较,请先将数组加载到内存中.

First off, you're comparing the time it takes to read things from disk to the time it takes to read things in memory. Load the array into memory first, if you want to compare to what you're doing in matlab.

第二,python内置的sum对于numpy数组非常无效. (或者,相反,独立地遍历numpy数组的每个项目非常慢,这是python内置的sum所做的.)对于numpy数组,请使用numpy.sum(yourarray)yourarray.sum().

Secondly, python's builtin sum is very inefficent for numpy arrays. (Or, rather, iterating through every item of a numpy array independently is very slow, which is what python's builtin sum is doing.) Use numpy.sum(yourarray) or yourarray.sum() instead for numpy arrays.

例如:

(使用h5py,因为我比较熟悉.)

(Using h5py, because I'm more familiar with it.)

import h5py
import numpy as np

f = h5py.File('yourfile.hdf', 'r')
dataset = f['/M/data']

# Load the entire array into memory, like you're doing for matlab...
data = np.empty(dataset.shape, dataset.dtype)
dataset.read_direct(data)

print data.sum() #Or alternately, "np.sum(data)"

这篇关于将用-v7.3(HDF5)保存的Matlab稀疏矩阵加载到Python中并对其进行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆