有效计算 data.table 中的非 NA 元素 [英] Efficiently counting non-NA elements in data.table

查看:18
本文介绍了有效计算 data.table 中的非 NA 元素的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有时我需要计算 data.table 中一列或另一列中非 NA 元素的数量.最好的 data.table 定制方法是什么?

Sometimes I need to count the number of non-NA elements in one or another column in my data.table. What is the best data.table-tailored way to do so?

为了具体起见,让我们使用这个:

For concreteness, let's work with this:

DT <- data.table(id = sample(100, size = 1e6, replace = TRUE),
                 var = sample(c(1, 0, NA), size = 1e6, replace = TRUE), key = "id")

我想到的第一件事是这样的:

The first thing that comes to my mind works like this:

DT[!is.na(var), N := .N, by = id]

但这有一个不幸的缺点,即 N 没有被分配给任何缺少 var 的行,即 DT[is.na(var),N] = NA.

But this has the unfortunate shortcoming that N does not get assigned to any row where var is missing, i.e. DT[is.na(var), N] = NA.

所以我通过附加来解决这个问题:

So I work around this by appending:

DT[!is.na(var), N:= .N, by = id][ , N := max(N, na.rm = TRUE), by = id] #OPTION 1

但是,我不确定这是最好的方法;我想到的另一种选择,也是类似 this data.frames 的问题是:

However, I'm not sure this is the best approach; another option I thought of and one suggested by the analog to this question for data.frames would be:

DT[ , N := length(var[!is.na(var)]), by = id] # OPTION 2

DT[ , N := sum(!is.na(var)), by = id] # OPTION 3

比较这些计算时间(平均超过 100 次试验),最后一个似乎是最快的:

Comparing computation time of these (average over 100 trials), the last seems to be the fastest:

OPTION 1 | OPTION 2 | OPTION 3
  .075   |   .065   |   .043

有谁知道 data.table 的更快方法吗?

Does anyone know a speedier way for data.table?

推荐答案

是的,第 3 个选项似乎是最好的.我添加了另一个只有当您考虑将 data.table 的键从 id 更改为 var 时才有效的选项,但选项 3 仍然是您最快的数据.

Yes the option 3rd seems to be the best one. I've added another one which is valid only if you consider to change the key of your data.table from id to var, but still option 3 is the fastest on your data.

library(microbenchmark)
library(data.table)

dt<-data.table(id=(1:100)[sample(10,size=1e6,replace=T)],var=c(1,0,NA)[sample(3,size=1e6,replace=T)],key=c("var"))

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)

microbenchmark(times=10L,
               dt1[!is.na(var),.N,by=id][,max(N,na.rm=T),by=id],
               dt2[,length(var[!is.na(var)]),by=id],
               dt3[,sum(!is.na(var)),by=id],
               dt4[.(c(1,0)),.N,id,nomatch=0L])
# Unit: milliseconds
#                                                         expr      min       lq      mean    median        uq       max neval
#  dt1[!is.na(var), .N, by = id][, max(N, na.rm = T), by = id] 95.14981 95.79291 105.18515 100.16742 112.02088 131.87403    10
#                     dt2[, length(var[!is.na(var)]), by = id] 83.17203 85.91365  88.54663  86.93693  89.56223 100.57788    10
#                             dt3[, sum(!is.na(var)), by = id] 45.99405 47.81774  50.65637  49.60966  51.77160  61.92701    10
#                        dt4[.(c(1, 0)), .N, id, nomatch = 0L] 78.50544 80.95087  89.09415  89.47084  96.22914 100.55434    10

这篇关于有效计算 data.table 中的非 NA 元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆