标记组的开始和结束 [英] Mark start and end of groups
问题描述
请考虑以下形式的 data.table
结构:
Consider a data.table
structure of the form
seller buyer month
1: 50536344 61961225 1993-01-01
2: 50536344 61961225 1993-02-01
3: 50536344 61961225 1993-04-01
4: 50536344 61961225 1993-05-01
5: 50536344 61961225 1993-06-01
code>(买方,卖方)成对。我想标记每对的开始和结束。例如,我们看到有一对从1月到2月,没有在3月,一个从4月到6月。因此,以下是预期输出:
where I have (buyer, seller)
pairs over time. I want to mark the start and end for every pair. For example, we see that there was a pair from January to February, none on March, and one from April to June. Hence, the following would be the expected output:
seller buyer month start end
1: 50536344 61961225 1993-01-01 True False
2: 50536344 61961225 1993-02-01 False True
3: 50536344 61961225 1993-04-01 True False
4: 50536344 61961225 1993-05-01 False False
5: 50536344 61961225 1993-06-01 False True
推荐答案
假设 month
在 Date
类中(或类似地, POSIXt
,
IDateTime
或其他具有 diff
方法的类),可以使用 diff
函数做这个。
Assuming that the month
is in Date
class (or similarly for POSIXt
, IDateTime
or other classes with diff
method), you can use the diff
function do this.
# sort data.table
setkeyv(dt, c("seller", "buyer", "month"))
# define start
dt[, start := c(TRUE, diff(month) > 31), by = list(seller, buyer)]
# define end
dt[, end := c(diff(month) > 31, TRUE), by = list(seller, buyer)]
编辑:根据@David Arenburg的建议:你可以一次性定义开始和结束。这应该稍快,虽然我也发现它有点更难阅读。
Per suggestion of @David Arenburg: You can of course define the start and end in one go. This should be slightly faster, although I also find it a bit more difficult to read.
dt[, ":=" (start = c(TRUE, diff(month) > 31),
end = c(diff(month) > 31, TRUE)),
by = list(seller, buyer)]
EDIT2:发生的一些更多的解释:每对卖方和买方的第一个观察将始终是业务关系的开始,因此 start = c TRUE,...)
。之后,如果且仅当时间差大于一个月(31天)时,进一步的观察将是商业关系的开始,因此 diff(month)> 31
。把两个东西放在一起,你会得到 c(TRUE,diff(month)> 31)
。
类似的逻辑适用于结束,其中你必须与下一次观察而不是前一次观察进行比较。
Some more explonation of what is happening: The first observation for each pair of seller and buyer will always be the start of a business relationship, so start = c(TRUE, ...)
. After that a further observation will be the start of a business relationship if and only if the difference in time is larger than a month (31 days), so diff(month) > 31
. Putting the two things together, you get c(TRUE, diff(month) > 31)
.
A similar logic applies for the end, where you have to compare to the next observation instead of the previous one.
这篇关于标记组的开始和结束的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!