如何在R的稀疏矩阵中查找和命名连续的非零条目? [英] How to find and name contiguous non-zero entries in a sparse matrix in R?

查看:115
本文介绍了如何在R的稀疏矩阵中查找和命名连续的非零条目?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题在概念上很简单. 我正在寻找一种计算有效的解决方案(我自己附上了我自己的解决方案).

My problem is conceptually simple. I am looking for a computationally efficient solution of it (my own one I attach at the end).

假设我们有一个潜在的非常大的稀疏矩阵,例如下面的左侧,并希望用单独的代码命名"连续的非零元素的每个区域(请参见右侧的矩阵)

Suppose we have a potentially very large sparse matrix like the one on the left below and want to 'name' every area of contiguous non-zero elements with a separate code (see matrix on the right)

1 1 1 . . . . .          1 1 1 . . . . .
1 1 1 . 1 1 . .          1 1 1 . 4 4 . .
1 1 1 . 1 1 . .          1 1 1 . 4 4 . .
. . . . 1 1 . .   --->   . . . . 4 4 . .
. . 1 1 . . 1 1          . . 3 3 . . 7 7
1 . 1 1 . . 1 1          2 . 3 3 . . 7 7
1 . . . 1 . . .          2 . . . 5 . . .
1 . . . . 1 1 1          2 . . . . 6 6 6

在我的应用程序中,连续的元素将形成矩形,直线或单点,并且它们只能与顶点相互接触(即矩阵中不会有不规则/非矩形的区域).

In my application the contiguous elements would form rectangles, lines or single points and they can only touch each other with the vertexes (i.e. there would be no irregular/non rectangular areas in the matrix).

我想象的解决方案是将稀疏矩阵表示的行索引和列索引匹配到具有适当值(名称"代码)的向量.我的解决方案使用多个for loops并适用于中小型矩阵,但随着矩阵的尺寸变大(> 1000),将很快陷入循环.这可能取决于我对R编程的了解并不高-我找不到任何计算技巧/函数可以更好地解决它.

The solution I imagined is to match the row and column indexes of the sparse matrix representation to a vector with the appropriate values (the 'name' codes). My solution uses several for loops and works fine for small to medium matrices but will fast get stuck in the loops as the dimensions of the matrix become large (>1000). It probably depends from the fact I'm not so advanced in R programming - I couldn't find any computational trick/function to solve it better.

有人可以建议在R中使用一种计算更有效的方法吗?

Can anybody suggest a computationally more efficient way to do that in R?

我的解决方案:

mySolution <- function(X){

  if (class(X) != "ngCMatrix") {stop("Input must be a Sparse Matrix")}
  ind <- which(X == TRUE, arr.ind = TRUE)
  r <- ind[,1]
  c <- ind[,2]

  lr <- nrow(ind)
  for (i in 1:lr) {
    if(i == 1) {bk <- 1}
    else {
      if (r[i]-r[i-1] == 1){bk <- c(bk, bk[i-1])}
      else {bk <- c(bk, bk[i-1]+1)}
    }
  }

  for (LOOP in 1:(lr-1)) {
    tr <- r[LOOP]
    tc <- c[LOOP]
    for (j in (LOOP+1):lr){
      if (r[j] == tr) {
        if(c[j] == tc + 1) {bk[j] <- bk[LOOP]} 
      }
    }
  }

  val <- unique(bk)
  for (k in 1:lr){
    bk[k] <- which(val==bk[k])
  }

  return(sparseMatrix(i = r, j = c, x = bk))
}

在此先感谢您的帮助或指点.

Thanks in advance for any help or pointer.

推荐答案

在很大程度上依赖于所有要分组的相邻元素仅形成矩形/直线/点的事实,我们看到矩阵的元素可以基于它们的元素进行汇总[row, col]通过关系(abs(row1 - row2) + abs(col1 - col2)) < 2在矩阵上进行索引.

Relying heavily on the fact that all neighbouring elements to be grouped form only rectangles/lines/points, we see that elements of the matrix can be aggregated based on their [row, col] indices on the matrix by the relation (abs(row1 - row2) + abs(col1 - col2)) < 2.

因此,从[row, col]索引开始:

sm = as.matrix(summary(m))

我们计算它们的距离,正如GiuGe所指出的那样,这实际上是曼哈顿"方法:

We compute their distance, which -as noted by GiuGe- is actually the "manhattan" method:

d = dist(sm, "manhattan")

在其最近邻居的聚类元素中的单链接属性在这里很有用.另外,我们可以通过对

进行cutree分组(索引距离为"<2")来对元素进行分组:

Single-linkage's property in clustering elements on their nearest neighbour is useful here. Also, we can get a grouping of the elements by cutreeing on "h = 1" (where the distance of indices is "< 2"):

gr = cutree(hclust(d, "single"), h = 1)

最后,我们可以将以上内容包装在一个新的稀疏矩阵中:

Finally, we can wrap the above in a new sparse matrix:

sparseMatrix(i = sm[, "i"], j = sm[, "j"], x = gr)
#8 x 8 sparse Matrix of class "dgCMatrix"
#                    
#[1,] 1 1 1 . . . . .
#[2,] 1 1 1 . 4 4 . .
#[3,] 1 1 1 . 4 4 . .
#[4,] . . . . 4 4 . .
#[5,] . . 3 3 . . 7 7
#[6,] 2 . 3 3 . . 7 7
#[7,] 2 . . . 5 . . .
#[8,] 2 . . . . 6 6 6

使用的"m"是:

library(Matrix)
m = new("ngCMatrix"
    , i = c(0L, 1L, 2L, 5L, 6L, 7L, 0L, 1L, 2L, 0L, 1L, 2L, 4L, 5L, 4L, 
5L, 1L, 2L, 3L, 6L, 1L, 2L, 3L, 7L, 4L, 5L, 7L, 4L, 5L, 7L)
    , p = c(0L, 6L, 9L, 14L, 16L, 20L, 24L, 27L, 30L)
    , Dim = c(8L, 8L)
    , Dimnames = list(NULL, NULL)
    , factors = list()
)

编辑 2017年2月10日

另一个想法(再次考虑到相邻元素仅形成矩形/直线/点这一事实)是通过[row, col]索引(在递增的列中)进行迭代,并在每个步骤中找到每个元素的距离当前列和行中其最近邻居的数量.如果找到"<2"的距离,则将该元素与其邻居分组,否则开始新的分组.包装成一个函数:

Another idea (and, again, considering the fact that neighbouring elements form only rectangles/lines/points) is to iterate -in ascending columns- through the [row, col] indices and, in each step, find the distance of each element of its nearest neighbour in current column and row. If a "< 2" distance is found then the element is grouped with its neighbour, else a new group is started. Wrapped into a function:

ff = function(x) 
{
    sm = as.matrix(summary(x))

    gr = integer(nrow(sm)); ngr = 0L ; gr[1] = ngr 

    lastSeenRow = integer(nrow(x))
    lastSeenCol = integer(ncol(x))

    for(k in 1:nrow(sm)) {
        kr = sm[k, 1]; kc = sm[k, 2]
        i = lastSeenRow[kr]
        j = lastSeenCol[kc]

        if(i && (abs(kc - sm[i, 2]) == 1)) gr[k] = gr[i]
        else if(j && (abs(kr - sm[j, 1]) == 1)) gr[k] = gr[j]  
             else { ngr = ngr + 1L; gr[k] = ngr } 

        lastSeenRow[kr] = k
        lastSeenCol[kc] = k        
    }

    sparseMatrix(i = sm[, "i"], j = sm[, "j"], x = gr)                 
}                  

并应用于"m":

ff(m)
#8 x 8 sparse Matrix of class "dgCMatrix"
#                    
#[1,] 1 1 1 . . . . .
#[2,] 1 1 1 . 4 4 . .
#[3,] 1 1 1 . 4 4 . .
#[4,] . . . . 4 4 . .
#[5,] . . 3 3 . . 7 7
#[6,] 2 . 3 3 . . 7 7
#[7,] 2 . . . 5 . . .
#[8,] 2 . . . . 6 6 6

另外,两个函数以相同的顺序返回组也很方便,我们可以检查一下:

Also, it's convenient that both functions return groups in the same order, as we can check:

identical(mySolution(m), ff(m))
#[1] TRUE

在一个看似更复杂的示例中:

On a, seemingly, more complex example:

mm = new("ngCMatrix"
    , i = c(25L, 26L, 27L, 25L, 29L, 25L, 25L, 17L, 18L, 26L, 3L, 4L, 5L, 
14L, 17L, 18L, 25L, 27L, 3L, 4L, 5L, 17L, 18L, 23L, 26L, 3L, 
4L, 5L, 10L, 17L, 18L, 9L, 11L, 17L, 18L, 10L, 17L, 18L, 3L, 
17L, 18L, 21L, 17L, 18L, 17L, 18L, 1L, 2L, 3L, 4L, 16L, 8L, 17L, 
18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 7L, 9L, 10L, 11L, 26L, 
8L, 27L, 1L, 2L, 28L, 1L, 2L, 15L, 27L, 1L, 2L, 21L, 22L, 1L, 
2L, 7L, 21L, 22L, 1L, 2L, 6L, 24L, 1L, 2L, 5L, 11L, 16L, 25L, 
26L, 27L, 4L, 15L, 17L, 19L, 25L, 26L, 27L, 3L, 16L, 25L, 26L, 
27L, 2L, 28L, 1L)
    , p = c(0L, 0L, 3L, 3L, 5L, 6L, 7L, 7L, 10L, 18L, 25L, 31L, 35L, 38L, 
42L, 44L, 46L, 51L, 61L, 66L, 68L, 71L, 75L, 79L, 84L, 88L, 96L, 
103L, 108L, 110L, 111L)
    , Dim = c(30L, 30L)
    , Dimnames = list(NULL, NULL)
    , factors = list()
)
identical(mySolution(mm), ff(mm))
#[1] TRUE

在更大的矩阵上有一个简单的基准:

And a simple benchmark on a larger matrix:

times = 30 # times `dim(mm)`
MM2 = do.call(cbind, rep_len(list(do.call(rbind, rep_len(list(mm), times))), times))
dim(MM2)
#[1] 900 900

system.time({ ans1 = mySolution(MM2) })
#   user  system elapsed 
# 449.50    0.53  463.26

system.time({ ans2 = ff(MM2) })
#   user  system elapsed 
#   0.51    0.00    0.52

identical(ans1, ans2)
#[1] TRUE

这篇关于如何在R的稀疏矩阵中查找和命名连续的非零条目?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆