如何从一个巨大的矩阵中获得尽可能少的行NA，并获得最大的可能的列序列? [英] How to get the largest possible column sequence with the least possible row NAs from a huge matrix?

查看：121 发布时间：2020/7/3 21:06:28 r cluster-analysis powerset traminer sequence-analysis

本文介绍了如何从一个巨大的矩阵中获得尽可能少的行NA，并获得最大的可能的列序列?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想从数据框中选择列，以使所得的连续列序列尽可能长，而具有NA的行数则尽可能少，因为它们必须然后放下.

I want to select columns from a data frame so that the resulting continuous column-sequences are as long as possible, while the number of rows with NAs is as small as possible, because they have to be dropped afterwards.

(我想这样做的原因是，我想运行TraMineR::seqsubm()以自动获取过渡成本矩阵(按过渡概率)，然后在其上运行cluster::agnes().TraMineR::seqsubm()不喜欢矩阵中的NA状态和具有NA状态的cluster::agnes()不一定有意义.)

(The reason I want to do this is, that I want to run TraMineR::seqsubm() to automatically get a matrix of transition costs (by transition probability) and later run cluster::agnes() on it. TraMineR::seqsubm() doesn't like NA states and cluster::agnes() with NA states in the matrix doesn't necessarily make much sense.)

为此，我已经编写了一个有效的函数，该函数原则上会计算所有可能的功率子集并对其进行检查对于NA s.此玩具数据d代表10x5矩阵，效果很好:

For that purpose I already wrote a working function that computes by principle all possible power-subsets and checks them for NAs. It works well with this toy data d which represents a 10x5 matrix:

> d
   id X1 X2 X3 X4 X5
1   A  1 11 21 31 41
2   B  2 12 22 32 42
3   C  3 13 23 33 NA
4   D  4 14 24 34 NA
5   E  5 15 25 NA NA
6   F  6 16 26 NA NA
7   G  7 17 NA NA NA
8   H  8 18 NA NA NA
9   I  9 NA NA NA NA
10  J 10 NA NA NA NA
11  K NA NA NA NA NA

现在的问题是，我实际上想将该算法应用于表示 34235 x 17矩阵的数据！

The problem now is that I actually want to apply the algorithm to survey data that would represent a 34235 x 17 matrix!

我的代码已在代码审查"中进行了审查，但仍然无法应用于真实数据.

My code has been reviewed on Code Review, but still cannot be applied to the real data.

我知道，采用这种方法将需要大量的计算. (对于非超级计算机来说可能太大了吗?！)

I am aware that with this approach there would be a huge calculation. (Presumably too huge for non-supercomputers?!)

有人知道更合适的方法吗?

Does anyone know a more suitable approach?

我向您展示了由@minem提供的增强功能:

I show you the already enhanced function by @minem from Code Review:

seqRank2 <- function(d, id = "id") {
  require(matrixStats)

  # change structure, convert to matrix
  ii <- as.character(d[, id])
  dm <- d
  dm[[id]] <- NULL
  dm <- as.matrix(dm)
  rownames(dm) <- ii

  your.powerset = function(s){
    l = vector(mode = "list", length = 2^length(s))
    l[[1]] = numeric()
    counter = 1L
    for (x in 1L:length(s)) {
      for (subset in 1L:counter) {
        counter = counter + 1L
        l[[counter]] = c(l[[subset]], s[x])
      }
    }
    return(l[-1])
  }

  psr <- your.powerset(ii)
  psc <- your.powerset(colnames(dm))

  sss <- lapply(psr, function(x) {
    i <- ii %in% x
    lapply(psc, function(y) dm[i, y, drop =  F])
    })

  cn <- sapply(sss, function(x)
    lapply(x, function(y) {

      if (ncol(y) == 1) {
        if (any(is.na(y))) return(NULL)
          return(y)
        }

      isna2 <- matrixStats::colAnyNAs(y)
      if (all(isna2)) return(NULL)
      if (sum(isna2) == 0) return(NA)
      r <- y[, !isna2, drop = F]
      return(r)
      }))

  scr <- sapply(cn, nrow)
  scc <- sapply(cn, ncol)

  namesCN <- sapply(cn, function(x) paste0(colnames(x), collapse = ", "))
  names(scr) <- namesCN
  scr <- unlist(scr)

  names(scc) <- namesCN
  scc <- unlist(scc)

  m <- t(rbind(n.obs = scr, sq.len = scc))
  ag <- aggregate(m, by = list(sequence = rownames(m)), max)
  ag <- ag[order(-ag$sq.len, -ag$n.obs), ]
  rownames(ag) <- NULL
  return(ag)
}

屈服:

> seqRank2(d)
         sequence n.obs sq.len
1  X1, X2, X3, X4     4      4
2      X1, X2, X3     6      3
3      X1, X2, X4     4      3
4      X1, X3, X4     4      3
5      X2, X3, X4     4      3
6          X1, X2     8      2
7          X1, X3     6      2
8          X2, X3     6      2
9          X1, X4     4      2
10         X2, X4     4      2
11         X3, X4     4      2
12             X1    10      1
13             X2     8      1
14             X3     6      1
15             X4     4      1
16             X5     2      1

> system.time(x <- seqRank2(d))
   user  system elapsed 
   1.93    0.14    2.93

在这种情况下，我会选择X1, X2, X3, X4，X1, X2, X3或X2, X3, X4，因为它们是连续的并会产生适当数量的观测值.

In this case I would choose X1, X2, X3, X4, X1, X2, X3 or X2, X3, X4 because they're continuous and yield an appropriate number of observations.

预期输出:

因此对于玩具数据d，预期输出将类似于:

So for toy data d the expected output would be something like:

> seqRank2(d)
sequence n.obs sq.len
1  X1, X2, X3, X4     4      4
2      X1, X2, X3     6      3
3      X2, X3, X4     4      3
4          X1, X2     8      2
5          X2, X3     6      2
6          X3, X4     4      2
7              X1    10      1
8              X2     8      1
9              X3     6      1
10             X4     4      1
11             X5     2      1

最后，函数应该在巨大的矩阵d.huge上正确运行，这会导致当前错误:

And at the end the function should run properly on the huge matrix d.huge which leads to errors at the moment:

> seqRank2(d.huge)
Error in vector(mode = "list", length = 2^length(s)) : 
  vector size cannot be infinite

玩具数据d:

d <- structure(list(id = structure(1:11, .Label = c("A", "B", "C", 
"D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), X1 = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, NA), X2 = c(11L, 12L, 13L, 
14L, 15L, 16L, 17L, 18L, NA, NA, NA), X3 = c(21L, 22L, 23L, 24L, 
25L, 26L, NA, NA, NA, NA, NA), X4 = c(31L, 32L, 33L, 34L, NA, 
NA, NA, NA, NA, NA, NA), X5 = c(41L, 42L, NA, NA, NA, NA, NA, 
NA, NA, NA, NA)), row.names = c(NA, -11L), class = "data.frame")

玩具数据d.huge:

d.huge <- setNames(data.frame(matrix(1:15.3e5, 3e4, 51)), 
                   c("id", paste0("X", 1:50)))
d.huge[, 41:51] <- lapply(d.huge[, 41:51], function(x){
  x[which(x %in% sample(x, .05*length(x)))] <- NA
  x
})

附录(请参阅评论最新答案):

d.huge <- read.csv("d.huge.csv")
d.huge.1 <- d.huge[sample(nrow(d.huge), 3/4*nrow(d.huge)), ]
d1 <- seqRank3(d.huge.1, 1.27e-1, 1.780e1)
d2 <- d1[complete.cases(d1), ]
dim(d2)
names(d2)

如何从一个巨大的矩阵中获得尽可能少的行NA，并获得最大的可能的列序列? [英] How to get the largest possible column sequence with the least possible row NAs from a huge matrix?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何从一个巨大的矩阵中获得尽可能少的行NA，并获得最大的可能的列序列? [英] How to get the largest possible column sequence with the least possible row NAs from a huge matrix?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭