Python-具有scipy稀疏矩阵的高效函数 [英] Python - Efficient Function with scipy sparse Matrices
问题描述
对于一个项目,我需要一个高效的python函数来解决以下任务:
for a project, I need an efficient function in python that solves to following task:
给出一个非常大的长稀疏向量列表X(=大稀疏矩阵)和另一个包含单个向量y的矩阵Y,我想要一个距离"列表,y对X的每个元素都有. 距离"的定义如下:
Given a very large List X of long sparse Vectors (=> big sparse Matrix) and another Matrix Y that contains a single Vector y, I want a List of "distances", that y has to every Element of X. Hereby the "distance" is defined like this:
比较两个向量的每个元素,始终取下一个向量并将其求和.
Compare each Element of the two Vectors, always take the lower one and sum them up.
示例:
X = [[0,0,2],
[1,0,0],
[3,1,0]]
Y = [[1,0,2]]
该函数应返回dist = [2,1,1]
The function should return dist = [2,1,1]
在我的项目中,X和Y都包含很多零,并作为以下项的一个实例出现:
In my project, both X and Y contain a lot of zeros and come in as an instance of:
<class 'scipy.sparse.csr.csr_matrix'>
到目前为止,我做的很好,我设法编写了一个解决该任务的函数,但是效率很低而且非常可怕.我需要一些有关如何高效处理/迭代稀疏矩阵的技巧. 这是我的功能:
So far so good and I managed to write a functions that solves this task, but is very slow and horrible inefficient. I need some tips on how to efficienty process/iterate the sparse Matrices. This is my function:
def get_distances(X, Y):
Ret=[]
rows, cols = X.shape
for i in range(0,rows):
dist = 0
sample = X.getrow(i).todense()
test = Y.getrow(0).todense()
rows_s, cols_s = sample.shape
rows_t, cols_t = test.shape
for s,t in zip(range(0, cols_s), range(0, cols_t)):
dist += min(sample[0,s], test[0,t])
X_ret.append([dist])
return ret
要执行操作,我将稀疏矩阵转换为密集矩阵,这当然很可怕,但是我不知道如何做得更好.您知道如何改善我的代码并使功能更快吗?
To do my Operations, I convert the sparse matrices to dense matrices which is of course horrible, but I did not know how to do it better. Do you know how to improve my code and make the function faster?
非常感谢!
推荐答案
我修改了函数并在其中运行
I revised your function and ran it in
import numpy as np
from scipy import sparse
def get_distances(X, Y):
ret=[]
for row in X:
sample = row.A
test = Y.getrow(0).A
dist = np.minimum(sample[0,:], test[0,:]).sum()
ret.append(dist)
return ret
X = [[0,0,2],
[1,0,0],
[3,1,0]]
Y = [[1,0,2]]
XM = sparse.csr_matrix(X)
YM = sparse.csr_matrix(Y)
print( get_distances(XM,YM))
print (np.minimum(XM.A, YM.A).sum(axis=1))
生产
1255:~/mypy$ python3 stack37056258.py
[2, 1, 1]
[2 1 1]
np.minimum
的元素明智最少为两个数组(可能是2d),因此我不需要在列上进行迭代.我也不需要使用索引.
np.minimum
takes element wise minimum of two arrays (may be 2d), so I don't need to iterate on columns. I also don't need to use indexing.
minimum
也为稀疏矩阵实现,但是当我尝试将其应用于您的X
(具有3行)和Y
(具有1)时,出现了细分错误.如果它们的大小相同,则可以进行以下操作:
minimum
is also implemented for sparse matrices, but I get a segmenation error when I try to apply it to your X
(with 3 rows) and Y
(with 1). If they are the same size this works:
Ys = sparse.vstack((YM,YM,YM))
print(Ys.shape)
print (XM.minimum(Ys).sum(axis=1))
将单行矩阵转换为数组也会避免该错误-因为它最终会使用密集版本np.minimum(XM.todense(), YM.A)
.
Converting the single row matrix to an array also gets around the error - because it ends up using the dense version, np.minimum(XM.todense(), YM.A)
.
print (XM.minimum(YM.A).sum(axis=1))
当我在这2个矩阵上尝试其他逐个元素的操作时,得到ValueError: inconsistent shapes
,例如XM+YM
或XM<YM
.看起来稀疏不像numpy
数组那样实现广播.
When I try other element by element operations on these 2 matrices I get ValueError: inconsistent shapes
, e.g. XM+YM
, or XM<YM
. Looks like sparse does not implement broadcasting as numpy
arrays does.
======================
=======================
多次复制1行稀疏矩阵的方式的比较
Comparison of ways of replicating a 1 row sparse matrix many times
In [271]: A=sparse.csr_matrix([0,1,0,0,1])
In [272]: timeit sparse.vstack([A]*3000).A
10 loops, best of 3: 32.3 ms per loop
In [273]: timeit sparse.kron(A,np.ones((3000,1),int)).A
1000 loops, best of 3: 1.27 ms per loop
很多时候,kron
比vstack
好.
======================
=======================
这篇关于Python-具有scipy稀疏矩阵的高效函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!