如何从一个巨大的矩阵中获得尽可能少的行NA,并获得最大的可能的列序列? [英] How to get the largest possible column sequence with the least possible row NAs from a huge matrix?

查看:121
本文介绍了如何从一个巨大的矩阵中获得尽可能少的行NA,并获得最大的可能的列序列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从数据框中选择列,以使所得的连续列序列尽可能长,而具有NA的行数则尽可能少,因为它们必须然后放下.

I want to select columns from a data frame so that the resulting continuous column-sequences are as long as possible, while the number of rows with NAs is as small as possible, because they have to be dropped afterwards.

(我想这样做的原因是,我想运行TraMineR::seqsubm()以自动获取过渡成本矩阵(按过渡概率),然后在其上运行cluster::agnes().TraMineR::seqsubm()不喜欢矩阵中的NA状态和具有NA状态的cluster::agnes()不一定有意义.)

(The reason I want to do this is, that I want to run TraMineR::seqsubm() to automatically get a matrix of transition costs (by transition probability) and later run cluster::agnes() on it. TraMineR::seqsubm() doesn't like NA states and cluster::agnes() with NA states in the matrix doesn't necessarily make much sense.)

为此,我已经编写了一个有效的函数,该函数原则上会计算所有可能的功率子集并对其进行检查对于NA s.此玩具数据d代表10x5矩阵,效果很好:

For that purpose I already wrote a working function that computes by principle all possible power-subsets and checks them for NAs. It works well with this toy data d which represents a 10x5 matrix:

> d
   id X1 X2 X3 X4 X5
1   A  1 11 21 31 41
2   B  2 12 22 32 42
3   C  3 13 23 33 NA
4   D  4 14 24 34 NA
5   E  5 15 25 NA NA
6   F  6 16 26 NA NA
7   G  7 17 NA NA NA
8   H  8 18 NA NA NA
9   I  9 NA NA NA NA
10  J 10 NA NA NA NA
11  K NA NA NA NA NA

现在的问题是,我实际上想将该算法应用于表示 34235 x 17矩阵的数据!

The problem now is that I actually want to apply the algorithm to survey data that would represent a 34235 x 17 matrix!

我的代码已在代码审查"中进行了审查,但仍然无法应用于真实数据.

My code has been reviewed on Code Review, but still cannot be applied to the real data.

我知道,采用这种方法将需要大量的计算. (对于非超级计算机来说可能太大了吗?!)

I am aware that with this approach there would be a huge calculation. (Presumably too huge for non-supercomputers?!)

有人知道更合适的方法吗?

Does anyone know a more suitable approach?

我向您展示了由@minem提供的增强功能:

I show you the already enhanced function by @minem from Code Review:

seqRank2 <- function(d, id = "id") {
  require(matrixStats)

  # change structure, convert to matrix
  ii <- as.character(d[, id])
  dm <- d
  dm[[id]] <- NULL
  dm <- as.matrix(dm)
  rownames(dm) <- ii

  your.powerset = function(s){
    l = vector(mode = "list", length = 2^length(s))
    l[[1]] = numeric()
    counter = 1L
    for (x in 1L:length(s)) {
      for (subset in 1L:counter) {
        counter = counter + 1L
        l[[counter]] = c(l[[subset]], s[x])
      }
    }
    return(l[-1])
  }

  psr <- your.powerset(ii)
  psc <- your.powerset(colnames(dm))

  sss <- lapply(psr, function(x) {
    i <- ii %in% x
    lapply(psc, function(y) dm[i, y, drop =  F])
    })

  cn <- sapply(sss, function(x)
    lapply(x, function(y) {

      if (ncol(y) == 1) {
        if (any(is.na(y))) return(NULL)
          return(y)
        }

      isna2 <- matrixStats::colAnyNAs(y)
      if (all(isna2)) return(NULL)
      if (sum(isna2) == 0) return(NA)
      r <- y[, !isna2, drop = F]
      return(r)
      }))

  scr <- sapply(cn, nrow)
  scc <- sapply(cn, ncol)

  namesCN <- sapply(cn, function(x) paste0(colnames(x), collapse = ", "))
  names(scr) <- namesCN
  scr <- unlist(scr)

  names(scc) <- namesCN
  scc <- unlist(scc)

  m <- t(rbind(n.obs = scr, sq.len = scc))
  ag <- aggregate(m, by = list(sequence = rownames(m)), max)
  ag <- ag[order(-ag$sq.len, -ag$n.obs), ]
  rownames(ag) <- NULL
  return(ag)
}

屈服:

> seqRank2(d)
         sequence n.obs sq.len
1  X1, X2, X3, X4     4      4
2      X1, X2, X3     6      3
3      X1, X2, X4     4      3
4      X1, X3, X4     4      3
5      X2, X3, X4     4      3
6          X1, X2     8      2
7          X1, X3     6      2
8          X2, X3     6      2
9          X1, X4     4      2
10         X2, X4     4      2
11         X3, X4     4      2
12             X1    10      1
13             X2     8      1
14             X3     6      1
15             X4     4      1
16             X5     2      1

> system.time(x <- seqRank2(d))
   user  system elapsed 
   1.93    0.14    2.93 

在这种情况下,我会选择X1, X2, X3, X4X1, X2, X3X2, X3, X4,因为它们是连续的并会产生适当数量的观测值.

In this case I would choose X1, X2, X3, X4, X1, X2, X3 or X2, X3, X4 because they're continuous and yield an appropriate number of observations.

预期输出:

因此对于玩具数据d,预期输出将类似于:

So for toy data d the expected output would be something like:

> seqRank2(d)
sequence n.obs sq.len
1  X1, X2, X3, X4     4      4
2      X1, X2, X3     6      3
3      X2, X3, X4     4      3
4          X1, X2     8      2
5          X2, X3     6      2
6          X3, X4     4      2
7              X1    10      1
8              X2     8      1
9              X3     6      1
10             X4     4      1
11             X5     2      1

最后,函数应该在巨大的矩阵d.huge上正确运行,这会导致当前错误:

And at the end the function should run properly on the huge matrix d.huge which leads to errors at the moment:

> seqRank2(d.huge)
Error in vector(mode = "list", length = 2^length(s)) : 
  vector size cannot be infinite

玩具数据d:

d <- structure(list(id = structure(1:11, .Label = c("A", "B", "C", 
"D", "E", "F", "G", "H", "I", "J", "K"), class = "factor"), X1 = c(1L, 
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, NA), X2 = c(11L, 12L, 13L, 
14L, 15L, 16L, 17L, 18L, NA, NA, NA), X3 = c(21L, 22L, 23L, 24L, 
25L, 26L, NA, NA, NA, NA, NA), X4 = c(31L, 32L, 33L, 34L, NA, 
NA, NA, NA, NA, NA, NA), X5 = c(41L, 42L, NA, NA, NA, NA, NA, 
NA, NA, NA, NA)), row.names = c(NA, -11L), class = "data.frame")

玩具数据d.huge:

d.huge <- setNames(data.frame(matrix(1:15.3e5, 3e4, 51)), 
                   c("id", paste0("X", 1:50)))
d.huge[, 41:51] <- lapply(d.huge[, 41:51], function(x){
  x[which(x %in% sample(x, .05*length(x)))] <- NA
  x
})

附录(请参阅评论最新答案):

d.huge <- read.csv("d.huge.csv")
d.huge.1 <- d.huge[sample(nrow(d.huge), 3/4*nrow(d.huge)), ]
d1 <- seqRank3(d.huge.1, 1.27e-1, 1.780e1)
d2 <- d1[complete.cases(d1), ]
dim(d2)
names(d2)

推荐答案

这需要花费不到一秒钟的时间来处理庞大的数据

This takes less than one second on the huge data

l1 = combn(2:length(d), 2, function(x) d[x[1]:x[2]], simplify = FALSE)
# If you also need "combinations" of only single columns, then uncomment the next line
# l1 = c(d[-1], l1)
l2 = sapply(l1, function(x) sum(complete.cases(x)))

score = sapply(1:length(l1), function(i) NCOL(l1[[i]]) * l2[i])
best_score = which.max(score)
best = l1[[best_score]]

关于如何对各种组合进行排名的问题尚不清楚.我们可以使用不同的评分公式来生成不同的偏好.例如,要分别权衡行数与列数

The question was unclear about how to rank the various combinations. We can use different scoring formulae to generate different preferences. For example, to weight number of rows versus columns separately we can do

col_weight = 2
row_weight = 1
score = sapply(1:length(l1), function(i) col_weight*NCOL(l1[[i]]) +  row_weight * l2[i])

这篇关于如何从一个巨大的矩阵中获得尽可能少的行NA,并获得最大的可能的列序列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆