“by”中的空因子数据表 [英] Empty factors in "by" data.table

查看:87
本文介绍了“by”中的空因子数据表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个data.table的因子列具有空的级别。我需要得到行计数和其他变量的总和,所有这些都按照多个因素分组,包括空的级别。
我的问题类似于此一个,但这里我需要考虑多个因素。

I have a data.table that has factor column with empty levels. I need to get the row count and sums of other variables, all grouped by multiple factors, including the one with empty levels. My question is similar to this one, but here I need to count for multiple factors.

例如,让data.table为:

For example, let data.table be:

library('data.table')

dtr <- data.table(v1=sample(1:15), 
v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
v3=sample(c("yes", "no"), 15, replace = TRUE))

我要执行以下操作:

dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]

#Output is:
   v2  v3 freq mm
1:  b yes    4 22
2:  b  no    1 13
3:  c  no    3 10
4:  a  no    4 49
5:  c yes    1 10
6:  a yes    2 16

我想输出包含v2的空白级别(d和e),例如 table(dtr $ v2,dtr $ v3),所以最终输出应该看起来像(顺序没有关系):

I want output include empty levels for v2 as well ("d" and "e"), like in table(dtr$v2,dtr$v3), so the final output should look like (the order doesn't matter):

   v2  v3 freq mm
1:  b yes    4 22
2:  b  no    1 13
3:  c  no    3 10
4:  a  no    4 49
5:  c yes    1 10
6:  a yes    2 16
7:  d yes    0 0
8:  d no    0 0
9:  e yes    0 0
10:  e no    0 0

我试图使用链接中使用的方法,但我不知道如何使用

I tried to use the method used in the link, but I'm not sure how to use joint J() function when there are multiple columns used.

这只适用于只有一列的分组:

This works fine for groupping by 1 column only:

setkey(dtr,v2)
dtr[J(levels(v2)),list(freq=.N,mm=sum(v1,na.rm=T))]

但是, dtr [J(levels(v2),v3) ,list(freq = .N,mm = sum(v1,na.rm = T))] 不包括所有组合

推荐答案

library(data.table)
set.seed(42)
dtr <- data.table(v1=sample(1:15), 
                  v2=factor(sample(letters[1:3], 15, replace = TRUE),levels=letters[1:5]),
                  v3=sample(c("yes", "no"), 15, replace = TRUE))

res <- dtr[,list(freq=.N,mm=sum(v1,na.rm=T)),by=list(v2,v3)]

您可以使用 CJ (交叉连接)。在聚合之后执行此操作可避免为大表设置键,并且应该更快。

You can use CJ (a cross join). Doing this after aggregation avoids setting the key for the big table and should be faster.

setkeyv(res,c("v2","v3"))
res[CJ(levels(dtr[,v2]),unique(dtr[,v3])),]

#    v2  v3 freq mm
# 1:  a  no    1  9
# 2:  a yes    2 11
# 3:  b  no    2 11
# 4:  b yes    3 23
# 5:  c  no    4 40
# 6:  c yes    3 26
# 7:  d  no   NA NA
# 8:  d yes   NA NA
# 9:  e  no   NA NA
# 10:  e yes   NA NA

这篇关于“by”中的空因子数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆