计算R中的稀疏成对距离矩阵 [英] Computing sparse pairwise distance matrix in R

查看:115
本文介绍了计算R中的稀疏成对距离矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个NxM矩阵,我想计算M点之间的欧几里得距离的NxN矩阵.在我的问题中,N约为100,000.当我计划将此矩阵用于k最近邻算法时,我只需要保持k最小距离,因此生成的NxN矩阵非常稀疏.例如,这与dist()产生的结果相反,后者会导致矩阵密集(可能是我的大小N的存储问题).

I have a NxM matrix and I want to compute the NxN matrix of Euclidean distances between the M points. In my problem, N is about 100,000. As I plan to use this matrix for a k-nearest neighbor algorithm, I only need to keep the k smallest distances, so the resulting NxN matrix is very sparse. This is in contrast to what comes out of dist(), for example, which would result in a dense matrix (and probably storage problems for my size N).

到目前为止,我发现的kNN软件包(knnflexkknn等)似乎都使用密集矩阵.另外,Matrix软件包不提供成对距离功能.

The packages for kNN that I've found so far (knnflex, kknn, etc) all appear to use dense matrices. Also, the Matrix package does not offer a pairwise distance function.

更接近我的目标,我看到spam程序包具有nearest.dist()功能,该功能允许用户仅考虑小于某个阈值delta的距离.但是,在我的情况下,特定值delta可能会产生太多距离(因此我必须密集存储NxN矩阵)或距离太少(因此我不能使用kNN).

Closer to my goal, I see that the spam package has a nearest.dist() function that allows one to only consider distances less than some threshold, delta. In my case, however, a particular value of delta may produce too many distances (so that I have to store the NxN matrix densely) or too few distances (so that I can't use kNN).

我之前已经看到过有关尝试执行的讨论>使用bigmemory/biganalytics软件包进行k-均值聚类,但在这种情况下,我似乎无法利用这些方法.

I have seen previous discussion on trying to perform k-means clustering using the bigmemory/biganalytics packages, but it doesn't seem like I can leverage these methods in this case.

有人知道在R中以稀疏方式计算距离矩阵的函数/实现吗?我的(可怕的)备份计划是有两个for循环并将结果保存在Matrix对象中.

Does anybody know a function/implementation that will compute a distance matrix in a sparse fashion in R? My (dreaded) backup plan is to have two for loops and save results in a Matrix object.

推荐答案

好吧,我们不能让您求助于for循环,现在我们可以:)

Well, we can't have you resorting to for-loops, now can we :)

当然存在如何表示稀疏矩阵的问题.一种简单的方法是使其仅包含最接近的点的索引(并根据需要重新计算).但是在下面的解决方案中,我将距离('d1'等)和索引('i1'等)都放在一个矩阵中:

There is of course the question of how to represent the sparse matrix. A simple way is to have it only contain the indices of the points that are closest (and recalculate as needed). But in the solution below, I put both distance ('d1' etc) and index ('i1' etc) in a single matrix:

sparseDist <- function(m, k) {
    m <- t(m)
    n <- ncol(m)
    d <- vapply( seq_len(n-1L), function(i) { 
        d<-colSums((m[, seq(i+1L, n), drop=FALSE]-m[,i])^2)
        o<-sort.list(d, na.last=NA, method='quick')[seq_len(k)]
        c(sqrt(d[o]), o+i) 
        }, numeric(2*k)
    )
    dimnames(d) <- list(c(paste('d', seq_len(k), sep=''),
        paste('i', seq_len(k), sep='')), colnames(m)[-n])
    d
}

尝试9个2d点:

> m <- matrix(c(0,0, 1.1,0, 2,0, 0,1.2, 1.1,1.2, 2,1.2, 0,2, 1.1,2, 2,2),
              9, byrow=TRUE, dimnames=list(letters[1:9], letters[24:25]))
> print(dist(m), digits=2)
    a   b   c   d   e   f   g   h
b 1.1                            
c 2.0 0.9                        
d 1.2 1.6 2.3                    
e 1.6 1.2 1.5 1.1                
f 2.3 1.5 1.2 2.0 0.9            
g 2.0 2.3 2.8 0.8 1.4 2.2        
h 2.3 2.0 2.2 1.4 0.8 1.2 1.1    
i 2.8 2.2 2.0 2.2 1.2 0.8 2.0 0.9
> print(sparseDist(m, 3), digits=2)
     a   b   c   d   e   f   g   h
d1 1.1 0.9 1.2 0.8 0.8 0.8 1.1 0.9
d2 1.2 1.2 1.5 1.1 0.9 1.2 2.0  NA
d3 1.6 1.5 2.0 1.4 1.2 2.2  NA  NA
i1 2.0 3.0 6.0 7.0 8.0 9.0 8.0 9.0
i2 4.0 5.0 5.0 5.0 6.0 8.0 9.0  NA
i3 5.0 6.0 9.0 8.0 9.0 7.0  NA  NA

并在更大的问题上尝试(10,000点).不过,在100k点和更多维度上,仍需要花费很长时间(例如15-30分钟).

And trying it on a larger problem (10k points). Still, on 100k points and more dimensions it will take a long time (like 15-30 minutes).

n<-1e4; m<-3; m=matrix(runif(n*m), n)
system.time( d <- sparseDist(m, 3) ) # 9 seconds on my machine...

P.S.刚刚注意到您在我写这篇文章时发布了一个答案:这里的解决方案速度大约是它的两倍,因为它不会两次计算相同的距离(点1和13之间的距离与点13和1之间的距离相同).

P.S. Just noted that you posted an answer as I was writing this: the solution here is roughly twice as fast because it doesn't calculate the same distance twice (the distance between points 1 and 13 is the same as between points 13 and 1).

这篇关于计算R中的稀疏成对距离矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆