基于另一行中的条件聚合数据表 [英] Aggregate data.table based on condition in another row
问题描述
我想基于两个条件聚合 data.table
,其中一个附加到另一行。这里是我的问题和一个可重复的例子:
I want to aggregate a data.table
based on two conditions, one of which is attached to another row. Here is my problem and a reproducible example:
我有一对起点目的地。 对于每个起点,我想对满足 condition1
的目的地的点数求和。
I have a pair of origins-destinations. For each origin, I want to sum the points in the destinations given condition1
is satisfied. However, there are two tricky issues.
- 每个起点 - 目的地对中的点只能求和一次
- 只有在逆向通量中满足IF
condition2
也就是说,AB
中的点只能在condition1 == T
和如果有BA
pair其中condition2 == T
- The points in each origin-destination pair can only be summed once
- The points should only be summed up IF
condition2
is satisfied in the reverse flux. That is, points inA-B
can only be summed ifcondition1==T
AND if there is aB-A
pair wherecondition2==T
< h3>可重现的范例:
Reproducible example:
library(data.table)
dt <- data.table( origin = c("A", "A", "A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "B", "B", "B", "B", "B", "C", "C", "B", "A", "C", "C", "C", "C", "C", "A", "A", "C", "C", "B", "B"),
destination = c("A", "A", "A", "A", "B", "B", "A", "A", "C", "C", "A", "A", "B", "B", "B", "C", "C", "B", "B", "A", "B", "C", "C", "C", "A", "A", "C", "C", "B", "B", "C", "C"),
points_in_dest = c(5, 5, 5, 5, 4, 4, 5, 5, 3, 3, 5, 5, 4, 4, 4, 3, 3, 4, 4, 5, 4, 3, 3, 3, 5,5, 3, 3, 4, 4, 3, 3),
depart_time = c(7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 18, 7, 8, 16, 7, 8, 16, 18, 8, 16, 7, 8, 18, 7, 8, 16, 18, 7, 8, 16, 18),
travel_time = c(0, 0, 0, 0, 70, 10, 70, 10, 10, 10, 70, 70, 0, 0, 0, 70, 10, 10, 70, 70, 10, 0, 0, 0, 10, 70, 10, 70, 10, 70, 70, 10) )
dt[ depart_time<=8 & travel_time < 60, condition1 := T] # condition 1 - trips must be in the morning and shorter than 60 min
dt[ depart_time>=16 & travel_time < 60, condition2 := T] # condition 2 - trips must be in the afternoon and shorter than 60 min
如果我总是考虑只有 condition1
,这是我得到的。注意,这个查询不涉及两个问题:(1)当有多个起点 - 目的地对满足 condition1
时,它是双计数点,不包括 condition2
不满足时的点
If I sum the points considering only condition1
, this is what I get. Note this query does not deal with two issues: (1) It is double counting points when there is more than one origin-destination pair that satisfies condition1
, (2) It is not excluding the points when condition2
is not satisfied
dt[ condition1==T, .(poits = sum(points_in_dest)), by=.(origin)]
> origin poits
> 1: A 20
> 2: B 11
> 3: C 15
所需输出
Desired output
> origin poits
> 1: A 9
> 2: B 7
> 3: C 12
我的真实数据框大约有8000万行, ,可能基于 data.table
。我意识到这是一个棘手的问题,我将不胜感激任何帮助。提前感谢
My real data frame is ~80 million rows, so I would appreciate an efficient solution, likely based on data.table
. I realize this is a tricky problem and I would appreciate any help. thanks in advance
这是一个常见的问题, 。问题是,你有多少工作机会,你选择给予你的时空约束,你住在块A,例如。在块A中有5个工作,在B中有4个工作,在块C中有3个工作,并且你们都有资格在所有工作中工作。但是,如果你可以在上午到办公室工作,你只能在工作岗位上工作( condition1
),如果你能在下午4点后回家( condition2
)。
This is a common problem in time-geography of accessibility with space-time constraints. The question is how many jobs opportunities you choose from given your space-time constraints and that you live in block A, for example. There are 5 jobs in block A, 4 jobs in B and 3 jobs in block C and in you are qualified to work in all of them. However, you can only work in a job position if you can get to the office in the morning (condition1
) AND if you can be back at home after 4pm (condition2
).
推荐答案
建议做相反的连接( destination
到 origin
和 origin
到目标
),然后按照原来的方式简单地对点进行求和,就像你已经做的那样。
As you only want to count every combination once, I would suggest to do an opposite join (destination
to origin
and origin
to destination
) on the unique subsets by both conditions and then simply sum the points by origin like you already doing.
我在解决这个问题时遇到了 data.table
中的错误,因此 setattr (res,sorted,NULL)
行(将删除键)。此解决方法不会影响性能。 我已填写错误报告。
I came across a bug in data.table
while solving this, hence the setattr(res, "sorted", NULL)
line (which will remove the keys). This workaround won't affect performance. I've filled a bug report.
setkey(dt, origin, destination) ## doing this so the `unique` function will work faster
res <- unique(dt[(condition1)])[unique(dt[(condition2)]),
on = c(destination = "origin", origin = "destination"),
nomatch = 0L]
setattr(res, "sorted", NULL) ### Fixing the bug
res[, .(points = sum(points_in_dest)), keyby = origin]
# origin points
# 1: A 9
# 2: B 7
# 3: C 12
这篇关于基于另一行中的条件聚合数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!