“by"中的空因子数据表 [英] Empty factors in "by" data.table
问题描述
我有一个 data.table,其中包含空级别的因子列.我需要获取其他变量的行数和总和,所有变量都按多个因素分组,包括具有空级别的因素.我的问题类似于这个一个,但在这里我需要考虑多个因素.
I have a data.table that has factor column with empty levels. I need to get the row count and sums of other variables, all grouped by multiple factors, including the one with empty levels. My question is similar to this one, but here I need to count for multiple factors.
例如,设 data.table 为:
For example, let data.table be:
library('data.table')
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
我想做以下事情:
dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]
#Output is:
v2 v3 freq mm
1: b yes 4 22
2: b no 1 13
3: c no 3 10
4: a no 4 49
5: c yes 1 10
6: a yes 2 16
我希望输出也包括 v2 的空级别(d"和e"),就像在 table(dtr$v2,dtr$v3)
中一样,所以最终输出应该看起来像(顺序无关紧要):
I want output include empty levels for v2 as well ("d" and "e"), like in table(dtr$v2,dtr$v3)
, so the final output should look like (the order doesn't matter):
v2 v3 freq mm
1: b yes 4 22
2: b no 1 13
3: c no 3 10
4: a no 4 49
5: c yes 1 10
6: a yes 2 16
7: d yes 0 0
8: d no 0 0
9: e yes 0 0
10: e no 0 0
我尝试使用链接中使用的方法,但是在使用多个列时我不确定如何使用联合J()函数.
I tried to use the method used in the link, but I'm not sure how to use joint J() function when there are multiple columns used.
这仅适用于按 1 列分组:
This works fine for groupping by 1 column only:
setkey(dtr,v2)
dtr[J(levels(v2)),list(freq=.N,mm=sum(v1,na.rm=T))]
但是,dtr[J(levels(v2),v3),list(freq=.N,mm=sum(v1,na.rm=T))]
并不包括所有组合
However, dtr[J(levels(v2),v3),list(freq=.N,mm=sum(v1,na.rm=T))]
doesn't include all combinations
推荐答案
library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15),
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))
res <- dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]
您可以使用 CJ
(交叉连接).在聚合之后这样做可以避免为大表设置键,并且应该更快.
You can use CJ
(a cross join). Doing this after aggregation avoids setting the key for the big table and should be faster.
setkey(res,c("v2","v3"))
res[CJ(levels(dtr[,v2]),unique(dtr[,v3])),]
# v2 v3 freq mm
# 1: a no 1 9
# 2: a yes 2 11
# 3: b no 2 11
# 4: b yes 3 23
# 5: c no 4 40
# 6: c yes 3 26
# 7: d no NA NA
# 8: d yes NA NA
# 9: e no NA NA
# 10: e yes NA NA
这篇关于“by"中的空因子数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!