计算scipy csr矩阵中的欧式距离 [英] Calculate the euclidean distance in scipy csr matrix

查看:187
本文介绍了计算scipy csr矩阵中的欧式距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要计算存储在csr稀疏矩阵中的所有点与一些点列表之间的欧几里得距离.对我来说,将csr转换为密集的csr会比较容易,但是由于内存不足,我无法这样做,因此我需要将其保留为csr.

I need to calculate the Euclidean Distance between all points that is stored in csr sparse matrix and some lists of points. It would be easier for me to convert the csr to a dense one, but I couldn't due to the lack of memory, so I need to keep it as csr.

因此,例如,我有这个 data_csr 稀疏矩阵(同时查看csr和稠密视图):

So for example I have this data_csr sparse matrix (view in both, csr and dense):

data_csr
(0, 2)  4
(1, 0)  1
(1, 4)  2
(2, 0)  2
(2, 3)  1
(3, 5)  1
(4, 0)  4
(4, 2)  3
(4, 3)  2

data_csr.todense()
[[0, 0, 4, 0, 0, 0]
 [1, 0, 0, 0, 2, 0]
 [2, 0, 0, 1, 0, 0]
 [0, 0, 0, 0, 0, 1]
 [4, 0, 3, 2, 0, 0]]

以及此中心点列表:

center
array([[0, 1, 2, 2, 4, 1],
      [3, 4, 1, 2, 4, 0]])

使用scipy.spatial包, data_csr center 之间的欧氏距离数组将类似于下面的数组.因此,相对于 data_csr 中的所有行,计算了 center 的每一行中总共6个点中的每个点.结果数组(2,5)的第一行是 center 的第一行与 data_csr 中的所有行之间的ED.

using the scipy.spatial package, the Euclidean Distance array between data_csr and center will be like the one below. So each point, of total 6 points, in each row of center was calculated against all rows in data_csr. The first row of the result array(2,5) is the ED between the first row of center and all rows in data_csr.

scipy.spatial.distance.cdist(center, data_csr, 'euclidean')

array([[ 5.09901951,  3.87298335,  5.19615242,  5.        ,  5.91607978],
      [ 7.34846923,  5.38516481,  5.91607978,  6.8556546 ,  6.08276253]])


到目前为止,我学到的东西可以使我得到非零值以及索引:


What I've learned so far that I can get the nonzero values as well the indices with:

data_csr.data
array([4, 1, 2, 2, 1, 1, 4, 3, 2])

data_csr.indices
array([2, 0, 4, 0, 3, 5, 0, 2, 3])

但是我仍然不知道如何计算这两个对象之间的ED.

But still I can't figure out how to calculate the ED between these two objects.

推荐答案

让我们创建矩阵(很遗憾,您没有提供我可以复制粘贴的输入)

So let's create your matrix (too bad you didn't give inputs that I could copy-n-paste)

In [114]: data=[4,1,2,2,1,1,4,3,2]   
In [115]: col=[0,1,1,2,2,3,4,4,4]
In [116]: row=[2,0,4,0,3,5,0,2,3]
In [117]: M=sparse.csr_matrix((data,(col,row)))

In [118]: M
Out[118]: 
<5x6 sparse matrix of type '<type 'numpy.int32'>'
    with 9 stored elements in Compressed Sparse Row format>

In [119]: M.A
Out[119]: 
array([[0, 0, 4, 0, 0, 0],
       [1, 0, 0, 0, 2, 0],
       [2, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1],
       [4, 0, 3, 2, 0, 0]])

In [121]: center=np.array([[0,1,2,2,4,1],[3,4,1,2,4,0]])

那么您如何计算距离? M.A是(5,6),center是(2,6).这两个数组在做什么并不明显.

So how did you calculate the distance? M.A is (5,6), center is (2,6). It's not obvious what you are doing with these two arrays.

对于访问原始"稀疏值,coo格式最容易理解.这是我用来创建矩阵的相同行,列,数据内容

As for access to the 'raw' sparse values, the coo format is easiest to understand. It's the same row,col,data stuff I used to create the matrix

In [131]: M.tocoo().data
Out[131]: array([4, 1, 2, 2, 1, 1, 4, 3, 2])

In [132]: M.tocoo().col
Out[132]: array([2, 0, 4, 0, 3, 5, 0, 2, 3])

In [133]: M.tocoo().row
Out[133]: array([0, 1, 1, 2, 2, 3, 4, 4, 4])

csrdataindicesindptr数组中存储相同的信息.但是,您必须做一些数学运算才能从最后2个值中提取i,j值.csr乘法例程充分利用了这些数组.

The csr stores the same info in data, indices and indptr arrays. But you have to do some math to pull to the i,j values from the last 2. csr multiplication routines make good use of these arrays.

通常,与csr矩阵相乘比加法/减法更好.

In general it is better to do multiplication with csr matrices than addition/subtraction.

我等待进一步的澄清.

spatial.distance.cdist(center,M.A, 'euclidean')
Out[156]: 
array([[ 5.09901951,  3.87298335,  5.19615242,  5.        ,  5.91607978],
       [ 7.34846923,  5.38516481,  5.91607978,  6.8556546 ,  6.08276253]])

我们需要做的是研究此功能,并了解其输入.我们可能不得不超越其文档,再看一下代码.

What we need to do is study this function, and understand its inputs. We may have to go beyond its docs and look at the code.

但是看这段代码,我看到了确保xB是2d数组且列数与xA相同的步骤.然后对于euclidian它调用

But looking at this code I see steps to ensure that xB is 2d array, with the same number of columns as xA. Then for euclidian it calls

_distance_wrap.cdist_euclidean_wrap(_convert_to_double(XA),
                                    _convert_to_double(XB), dm)

看起来像一些C代码的包装.我无法想象有什么方法可以将它送入稀疏矩阵.

which looks like a wrapper on some C code. I can't imagine any way of feeding it a sparse matrix.

您可以遍历行;用M[[0],:].A调用distM.A[[0],:]相同-除了速度.遍历稀疏矩阵的行有点慢,这是因为它必须在每次迭代时构造一个新的稀疏矩阵. csrlil是行迭代最快的2.

You could iterate over rows; calling dist with M[[0],:].A is the same as M.A[[0],:] - except for speed. Iterating over rows of a sparse matrix is kind of slow, as much because it has to construct a new sparse matrix at each iteration. csr and lil are the 2 fastest for row iteration.

有些事情可能会更快-直接迭代lil格式的属性:

Here's something that might be faster - directly iterating on the attributes of the lil format:

 def foo(a,b,n):
    # make a dense array from data,row
    res = np.zeros((1,n))
    res[0,b]=a
    return res

In [190]: Ml=M.tolil()

In [191]: Ml.data
Out[191]: array([[4], [1, 2], [2, 1], [1], [4, 3, 2]], dtype=object)

In [192]: Ml.rows
Out[192]: array([[2], [0, 4], [0, 3], [5], [0, 2, 3]], dtype=object)

In [193]: rowgen=(foo(a,b,6) for a,b in zip(Ml.data,Ml.rows))

In [194]: np.concatenate([spatial.distance.cdist(center,row, 'euclidean') for row in rowgen],axis=1)
Out[194]: 
array([[ 5.09901951,  3.87298335,  5.19615242,  5.        ,  5.91607978],
       [ 7.34846923,  5.38516481,  5.91607978,  6.8556546 ,  6.08276253]])

现在我将跳过时间测试.

For now I'll skip the time tests.

这篇关于计算scipy csr矩阵中的欧式距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆