如何在data.table中加快这个逐行操作 [英] How can I speed up this row-by-row operation in data.table

查看:236
本文介绍了如何在data.table中加快这个逐行操作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 data.table xe5 行和约100列。我想找到前3列索引,使得值不是 NA 0

I have a data.table with xe5 rows and approx 100 columns. I am looking to find the first 3 column index such that the value is not NA or 0.

m <- matrix(rep(NA_integer_, 1e6), ncol=10)
for(i in 1:nrow(m)){
    set.seed(i);
    m[i, sample(1:10, 5)] =  1L:5L
}
DT <- data.table(m);
DT
        V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
     1: NA  5  1  2  3 NA  4 NA NA  NA
     2: NA  1 NA NA  3  5  2 NA NA   4
     3: NA  1  4  3 NA NA NA  2  5  NA
     4:  2  4  3 NA  5  1 NA NA NA  NA
     5:  5  4  1 NA NA NA  2  3 NA  NA
    ---                               
 99996: NA NA  2  3  5  1 NA NA  4  NA
 99997:  2 NA NA NA  1 NA NA  3  5   4
 99998:  5 NA  4  2 NA  1  3 NA NA  NA
 99999: NA  5 NA  1 NA  4 NA  2 NA   3
100000:  5 NA NA NA  2  3  1 NA NA   4

f <- function(x){return(list(which(!is.na(x) & x!=0L)[1:3L]))}

#Here is what apply do
system.time(test <- apply(m, FUN=f, MAR=1))
utilisateur     système      écoulé 
       1.30        0.00        1.29

我发现它很慢,这可能不是 data.table 的任务,我正在寻找一种快速的方式

I find it very slow, this might not be a task for data.table, I am looking for a fast way of getting this answer (any method is welcome).

推荐答案

首先,你可以使用 0 / 0 NaN ,它也会给 TRUE .na 。这将减少到一个!is.na 。第二,你可以使用来赋值其中 arr.ind = TRUE row col 索引。我们可以用 row 分割,得到前三个 col 值如下:

First, you could use the fact that 0 /0 is NaN which will also give TRUE for is.na. This'll reduce to condition to one !is.na. Second, you can vectorise using which with arr.ind = TRUE that'll give a row and col index. We can use that to split by row and get the first three col values as follows:

system.time(tt <- data.table(which(!is.na(DT[, lapply(.SD, function(x) x/0)]), 
             arr.ind=TRUE), key="row")[, col[1:3], by="row"])
   user  system elapsed
  0.360   0.000   0.359






编辑: / p>


an alternative way:

DT <- DT[, lapply(.SD, function(x) !is.na(x/0))]
out <- data.table(matrix(numeric(3e5), ncol=3))
system.time({    
for (i in as.integer(seq_along(DT))) {
    for (j in 1:3) {
        zeros <- .subset2(DT, i) & (out[[j]] == 0)
        out[zeros, names(out)[j] := i]
        DT[zeros, c(names(DT)[i]) := FALSE]
    }
}
})

不知道是否是最快的。

这篇关于如何在data.table中加快这个逐行操作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆