汇总每个观察值是否可以属于具有多个分组变量的多个组 [英] Aggregating if each observation can belong to multiple groups with multiple grouping variables

查看:74
本文介绍了汇总每个观察值是否可以属于具有多个分组变量的多个组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题是对以下问题的跟踪:如果每个观察可以属于多个组

This question is a follow up of : Aggregating if each observation can belong to multiple groups.

就像链接的问题一样,我的观察结果可以属于几个组。但是现在我有了2个分组变量,这使问题变得更加困难(至少对我而言)。
在下面的示例中,观察值可以属于A,B,C组中的一个或多个。但是我还想根据另一个因素进行区分,即x < 1,x <.5或y< 0。由于所有x较小的0也都小于1,因此每个观察值可以再次属于一个以上的组。我想根据两个分组(A,B,C和x <1,x <.5,y <0)进行汇总,并得出所有组合((A和x <1)的汇总, (A and x <.5),...,(C and x <0)。
让我知道问题是否还不够清楚,因为我无法提出来,请随时编辑标题

As in the linked question my observations can belong to several groups. But now I got 2 grouping variables, which makes the problem much harder (at least to me). In the example below an observation can belong to one or more of the groups A, B, C. But I also want to distinguish according to another factor, i.e. is x < 1, x <.5 or y < 0. Since all x smaller 0 are also smaller 1 each observation can again belong to more than one group. I want to aggregate according to both groupings (A, B, C and x < 1, x <.5, y < 0) and get as result an aggregate of all combinations ((A and x < 1), (A and x < .5), ..., (C and x < 0). Let me know if the question is not clear enough and feel free to edit the title since I could not come up with a proper one.

# The data
library(data.table)
n <- 500
set.seed(1)
TF <- c(TRUE, FALSE)
time <- rep(1:4, each = n/4)


df <- data.table(time = time, x = rnorm(n), groupA = sample(TF, size = n, replace = TRUE),
                 groupB = sample(TF, size = n, replace = TRUE),
                 groupC = sample(TF, size = n, replace = TRUE))

df[ ,c("smaller1", "smaller.5", "smaller0") := .(x <= 1, x <= 0.5, x <= 0)]

# The result should look like this (a solution for wide format would be nice as well) but less repetitive
rbind(
df[smaller1 == TRUE , .(lapply(.SD*x, sum), c("A_smaller1", "B_smaller1", "C_smaller1")), by=.(time),.SDcols = c("groupA", "groupB", "groupC")],
df[smaller.5 == TRUE , .(lapply(.SD*x, sum), c("A_smaller.5", "B_smaller.5", "C_smaller.5")), by=.(time),.SDcols = c("groupA", "groupB", "groupC")],
df[smaller0 == TRUE , .(lapply(.SD*x, sum), c("A_smaller0", "B_smaller0", "C_smaller0")), by=.(time),.SDcols = c("groupA", "groupB", "groupC")]
)


推荐答案

首先,您可以将group == TRUE的子集融为一体。接下来,使用 CJ (即交叉联接)创建所有组合的列表。然后对原始数据集执行非等价联接,并进行如下求和:

First, you can melt and subset to those with group==TRUE. Next, use CJ (i.e. cross join) to create a list of all combinations. Then perform an non-equi join with original dataset and do a sum as follows:

mDT <- melt(df, id.vars=c("time", "x"))[(value)]
mDT[CJ(time=time, variable=variable, Level=seq(0,1,0.5), unique=TRUE), 
    sum(x.x), 
    by=.EACHI, 
    on=.(time, variable, x < Level)]

这篇关于汇总每个观察值是否可以属于具有多个分组变量的多个组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆