带有稀疏矩阵的 scipy cdist [英] scipy cdist with sparse matrices

查看:54
本文介绍了带有稀疏矩阵的 scipy cdist的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算两组向量之间的距离,source_matrixtarget_matrix.

I need to calculate the distances between two sets of vectors, source_matrix and target_matrix.

我有以下行,当 source_matrixtarget_matrix 都是 scipy.sparse.csr.csr_matrix 类型时:

I have the following line, when both source_matrix and target_matrix are of type scipy.sparse.csr.csr_matrix:

distances = sp.spatial.distance.cdist(source_matrix, target_matrix)

我最终得到以下部分异常回溯:

And I end up getting the following partial exception traceback:

 File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 2060, in cdist
    [XA] = _copy_arrays_if_base_present([_convert_to_double(XA)])
  File "/usr/local/lib/python2.7/site-packages/scipy/spatial/distance.py", line 146, in _convert_to_double
    X = X.astype(np.double)
ValueError: setting an array element with a sequence.

这似乎表明稀疏矩阵被视为密集的numpy矩阵,这既失败又错过了使用稀疏矩阵的意义.

Which seem to indicate the sparse matrices are being treated as dense numpy matrices, which both fails and misses the point of using sparse matrices.

有什么建议吗?

推荐答案

我很欣赏这篇文章已经很老了,但正如建议的评论之一,您可以使用 sklearn 实现,它接受稀疏向量和矩阵.

I appreciate this post is quite old, but as one of the comments suggested, you could use the sklearn implementation which accepts sparse vectors and matrices.

以两个随机向量为例

a = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 3.14837228]]) # example output

或者即使 a 是一个矩阵而 b 是一个向量:

Or even if a is a matrix and b is a vector:

a = scipy.sparse.rand(m=500,n=100,density=0.2,format='csr')
b = scipy.sparse.rand(m=1,n=100,density=0.2,format='csr')
sklearn.metrics.pairwise.pairwise_distances(X=a, Y=b, metric='euclidean')
>>> array([[ 2.9864606 ], # example output
   [ 3.33862248],
   [ 3.45803465],
   [ 3.15453179],
   ...

Scipy spatial.distance 不支持稀疏矩阵,所以 sklearn 将是这里的最佳选择.您还可以将 n_jobs 参数传递给 sklearn.metrics.pairwise.pairwise_distances,如果您的向量非常大,它会分布计算.

Scipy spatial.distance does not support sparse matrices, so sklearn would be the best choice here. You can also pass the n_jobs argument to sklearn.metrics.pairwise.pairwise_distances which distributes the computation if your vectors are very large.

希望有帮助

这篇关于带有稀疏矩阵的 scipy cdist的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆