带有稀疏矩阵的 scipy cdist [英] scipy cdist with sparse matrices
问题描述
我需要计算两组向量之间的距离,source_matrix
和 target_matrix
.
I need to calculate the distances between two sets of vectors, source_matrix
and target_matrix
.
我有以下行,当 source_matrix
和 target_matrix
都是 scipy.sparse.csr.csr_matrix
类型时:
I have the following line, when both source_matrix
and target_matrix
are of type scipy.sparse.csr.csr_matrix
:
distances = sp.spatial.distance.cdist(source_matrix, target_matrix)
我最终得到以下部分异常回溯:
And I end up getting the following partial exception traceback:
File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 2060, in cdist
[XA] = _copy_arrays_if_base_present([_convert_to_double(XA)])
File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 146, in _convert_to_double
X = X.astype(np.double)
ValueError: setting an array element with a sequence.
这似乎表明稀疏矩阵被视为密集的numpy矩阵,这既失败又错过了使用稀疏矩阵的意义.
Which seem to indicate the sparse matrices are being treated as dense numpy matrices, which both fails and misses the point of using sparse matrices.
有什么建议吗?
推荐答案
我很欣赏这篇文章已经很老了,但正如建议的评论之一,您可以使用 sklearn 实现,它接受稀疏向量和矩阵.
I appreciate this post is quite old, but as one of the comments suggested, you could use the sklearn implementation which accepts sparse vectors and matrices.
以两个随机向量为例
a = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 3.14837228]]) # example output
或者即使 a
是一个矩阵而 b
是一个向量:
Or even if a
is a matrix and b
is a vector:
a = scipy.sparse.rand(m=500,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 2.9864606 ], # example output
[ 3.33862248],
[ 3.45803465],
[ 3.15453179],
...
Scipy spatial.distance 不支持稀疏矩阵,所以 sklearn 将是这里的最佳选择.您还可以将 n_jobs
参数传递给 sklearn.metrics.pairwise.pairwise_distances
,如果您的向量非常大,它会分布计算.
Scipy spatial.distance does not support sparse matrices, so sklearn would be the best choice here. You can also pass the n_jobs
argument to sklearn.metrics.pairwise.pairwise_distances
which distributes the computation if your vectors are very large.
希望有帮助
这篇关于带有稀疏矩阵的 scipy cdist的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!