基于另一行中的条件聚合数据表 [英] Aggregate data.table based on condition in another row

查看:74
本文介绍了基于另一行中的条件聚合数据表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想基于两个条件聚合 data.table ,其中一个附加到另一行。这里是我的问题和一个可重复的例子:

I want to aggregate a data.table based on two conditions, one of which is attached to another row. Here is my problem and a reproducible example:

我有一对起点目的地。 对于每个起点,我想对满足 condition1 的目的地的点数求和

I have a pair of origins-destinations. For each origin, I want to sum the points in the destinations given condition1 is satisfied. However, there are two tricky issues.


  1. 每个起点 - 目的地对中的点只能求和一次

  2. 只有在逆向通量中满足IF condition2 也就是说, AB 中的点只能在 condition1 == T 和如果有 BA pair其中 condition2 == T

  1. The points in each origin-destination pair can only be summed once
  2. The points should only be summed up IF condition2 is satisfied in the reverse flux. That is, points in A-B can only be summed if condition1==T AND if there is a B-A pair where condition2==T



< h3>可重现的范例:

Reproducible example:

library(data.table)
dt <-  data.table( origin = c("A", "A", "A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "B", "B", "B", "B", "B", "C", "C", "B", "A", "C", "C", "C", "C", "C", "A", "A", "C", "C", "B", "B"),
                   destination = c("A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "A", "A", "B", "B", "B", "C", "C", "B", "B", "A", "B", "C", "C", "C", "A", "A", "C", "C", "B", "B", "C", "C"),
                   points_in_dest = c(5, 5, 5, 5, 4, 4, 5, 5, 3, 3, 5, 5, 4, 4, 4, 3, 3, 4, 4, 5, 4, 3, 3, 3, 5,5, 3, 3, 4, 4, 3, 3),
                   depart_time = c(7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 7, 8, 16, 18, 8, 16, 7, 8, 18, 7, 8, 16, 18, 7, 8, 16, 18),   
                   travel_time = c(0, 0, 0, 0, 70, 10, 70, 10, 10, 10, 70, 70, 0, 0, 0, 70, 10, 10, 70, 70, 10, 0, 0, 0, 10, 70, 10, 70, 10, 70, 70, 10) )

 dt[ depart_time<=8  & travel_time < 60, condition1 := T] # condition 1 - trips must be in the morning and shorter than 60 min
 dt[ depart_time>=16 & travel_time < 60, condition2 := T] # condition 2 - trips must be in the afternoon and shorter than 60 min

如果我总是考虑只有 condition1 ,这是我得到的。注意,这个查询不涉及两个问题:(1)当有多个起点 - 目的地对满足 condition1 时,它是双计数点,不包括 condition2 不满足时的点

If I sum the points considering only condition1, this is what I get. Note this query does not deal with two issues: (1) It is double counting points when there is more than one origin-destination pair that satisfies condition1, (2) It is not excluding the points when condition2 is not satisfied

dt[ condition1==T, .(poits = sum(points_in_dest)), by=.(origin)]

>    origin poits
> 1:      A    20
> 2:      B    11
> 3:      C    15



所需输出



Desired output

>    origin poits
> 1:      A     9
> 2:      B     7
> 3:      C    12

我的真实数据框大约有8000万行, ,可能基于 data.table 。我意识到这是一个棘手的问题,我将不胜感激任何帮助。提前感谢

My real data frame is ~80 million rows, so I would appreciate an efficient solution, likely based on data.table. I realize this is a tricky problem and I would appreciate any help. thanks in advance

这是一个常见的问题, 。问题是,你有多少工作机会,你选择给予你的时空约束,你住在块A,例如。在块A中有5个工作,在B中有4个工作,在块C中有3个工作,并且你们都有资格在所有工作中工作。但是,如果你可以在上午到办公室工作,你只能在工作岗位上工作( condition1 ),如果你能在下午4点后回家( condition2 )。

This is a common problem in time-geography of accessibility with space-time constraints. The question is how many jobs opportunities you choose from given your space-time constraints and that you live in block A, for example. There are 5 jobs in block A, 4 jobs in B and 3 jobs in block C and in you are qualified to work in all of them. However, you can only work in a job position if you can get to the office in the morning (condition1) AND if you can be back at home after 4pm (condition2).

推荐答案

建议做相反的连接( destination origin origin 目标),然后按照原来的方式简单地对点进行求和,就像你已经做的那样。

As you only want to count every combination once, I would suggest to do an opposite join (destination to origin and origin to destination) on the unique subsets by both conditions and then simply sum the points by origin like you already doing.

我在解决这个问题时遇到了 data.table 中的错误,因此 setattr (res,sorted,NULL)行(将删除键)。此解决方法不会影响性能。 我已填写错误报告

I came across a bug in data.table while solving this, hence the setattr(res, "sorted", NULL) line (which will remove the keys). This workaround won't affect performance. I've filled a bug report.

setkey(dt, origin, destination) ## doing this so the `unique` function will work faster
res <- unique(dt[(condition1)])[unique(dt[(condition2)]), 
                                on = c(destination = "origin", origin = "destination"), 
                                nomatch = 0L]
setattr(res, "sorted", NULL) ### Fixing the bug
res[, .(points = sum(points_in_dest)), keyby = origin]
#    origin points
# 1:      A      9
# 2:      B      7
# 3:      C     12

这篇关于基于另一行中的条件聚合数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆