有效计数data.table中的非NA元素 [英] Efficiently counting non-NA elements in data.table
问题描述
有时候我需要计算 data.table
NA 个元素的数量>。 data.table
量身定制的最佳方法是什么?
Sometimes I need to count the number of non-NA
elements in one or another column in my data.table
. What is the best data.table
-tailored way to do so?
具体来说,让我们一起工作:
For concreteness, let's work with this:
DT <- data.table(id = sample(100, size = 1e6, replace = TRUE),
var = sample(c(1, 0, NA), size = 1e6, replace = TRUE), key = "id")
我想到的第一件事就是这样:
The first thing that comes to my mind works like this:
DT[!is.na(var), N := .N, by = id]
但是这有一个不幸的缺点是 N
不会分配给缺少 var
的任何行,即 DT [is。 na(var),N] = NA
。
But this has the unfortunate shortcoming that N
does not get assigned to any row where var
is missing, i.e. DT[is.na(var), N] = NA
.
所以我通过添加以下内容来解决此问题:
So I work around this by appending:
DT[!is.na(var), N:= .N, by = id][ , N := max(N, na.rm = TRUE), by = id] #OPTION 1
但是,我不确定这是最好的方法;我想到的另一种选择,也是由 data.frame
的这个问题将是:
However, I'm not sure this is the best approach; another option I thought of and one suggested by the analog to this question for data.frame
s would be:
DT[ , N := length(var[!is.na(var)]), by = id] # OPTION 2
和
DT[ , N := sum(!is.na(var)), by = id] # OPTION 3
比较这些计算时间(平均100多次试验) ,最后一个似乎是最快的:
Comparing computation time of these (average over 100 trials), the last seems to be the fastest:
OPTION 1 | OPTION 2 | OPTION 3
.075 | .065 | .043
有人知道 data.table
?
推荐答案
是的,选项3rd似乎是最好的选择。我添加了另一个仅当您考虑将data.table的键从 id
更改为 var $ c $时才有效c>,但选项3仍然是您数据上最快的。
Yes the option 3rd seems to be the best one. I've added another one which is valid only if you consider to change the key of your data.table from id
to var
, but still option 3 is the fastest on your data.
library(microbenchmark)
library(data.table)
dt<-data.table(id=(1:100)[sample(10,size=1e6,replace=T)],var=c(1,0,NA)[sample(3,size=1e6,replace=T)],key=c("var"))
dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)
microbenchmark(times=10L,
dt1[!is.na(var),.N,by=id][,max(N,na.rm=T),by=id],
dt2[,length(var[!is.na(var)]),by=id],
dt3[,sum(!is.na(var)),by=id],
dt4[.(c(1,0)),.N,id,nomatch=0L])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt1[!is.na(var), .N, by = id][, max(N, na.rm = T), by = id] 95.14981 95.79291 105.18515 100.16742 112.02088 131.87403 10
# dt2[, length(var[!is.na(var)]), by = id] 83.17203 85.91365 88.54663 86.93693 89.56223 100.57788 10
# dt3[, sum(!is.na(var)), by = id] 45.99405 47.81774 50.65637 49.60966 51.77160 61.92701 10
# dt4[.(c(1, 0)), .N, id, nomatch = 0L] 78.50544 80.95087 89.09415 89.47084 96.22914 100.55434 10
这篇关于有效计数data.table中的非NA元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!