标记组的开始和结束 [英] Mark start and end of groups

查看:137
本文介绍了标记组的开始和结束的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

请考虑以下形式的 data.table 结构:

Consider a data.table structure of the form

     seller    buyer      month  
1: 50536344 61961225 1993-01-01  
2: 50536344 61961225 1993-02-01 
3: 50536344 61961225 1993-04-01 
4: 50536344 61961225 1993-05-01 
5: 50536344 61961225 1993-06-01

code>(买方,卖方)成对。我想标记每对的开始和结束。例如,我们看到有一对从1月到2月,没有在3月,一个从4月到6月。因此,以下是预期输出:

where I have (buyer, seller) pairs over time. I want to mark the start and end for every pair. For example, we see that there was a pair from January to February, none on March, and one from April to June. Hence, the following would be the expected output:

     seller    buyer      month  start    end
1: 50536344 61961225 1993-01-01   True  False
2: 50536344 61961225 1993-02-01  False   True
3: 50536344 61961225 1993-04-01   True  False
4: 50536344 61961225 1993-05-01  False  False
5: 50536344 61961225 1993-06-01  False   True


推荐答案

假设 month Date 类中(或类似地, POSIXt IDateTime 或其他具有 diff 方法的类),可以使用 diff 函数做这个。

Assuming that the month is in Date class (or similarly for POSIXt, IDateTime or other classes with diff method), you can use the diff function do this.

# sort data.table
setkeyv(dt, c("seller", "buyer", "month"))
# define start
dt[, start := c(TRUE, diff(month) > 31), by = list(seller, buyer)]
# define end
dt[, end := c(diff(month) > 31, TRUE), by = list(seller, buyer)]

编辑:根据@David Arenburg的建议:你可以一次性定义开始和结束。这应该稍快,虽然我也发现它有点更难阅读。

Per suggestion of @David Arenburg: You can of course define the start and end in one go. This should be slightly faster, although I also find it a bit more difficult to read.

dt[, ":=" (start = c(TRUE, diff(month) > 31),
           end = c(diff(month) > 31, TRUE)), 
   by = list(seller, buyer)]

EDIT2:发生的一些更多的解释:每对卖方和买方的第一个观察将始终是业务关系的开始,因此 start = c TRUE,...)。之后,如果且仅当时间差大于一个月(31天)时,进一步的观察将是商业关系的开始,因此 diff(month)> 31 。把两个东西放在一起,你会得到 c(TRUE,diff(month)> 31)
类似的逻辑适用于结束,其中你必须与下一次观察而不是前一次观察进行比较。

Some more explonation of what is happening: The first observation for each pair of seller and buyer will always be the start of a business relationship, so start = c(TRUE, ...). After that a further observation will be the start of a business relationship if and only if the difference in time is larger than a month (31 days), so diff(month) > 31. Putting the two things together, you get c(TRUE, diff(month) > 31). A similar logic applies for the end, where you have to compare to the next observation instead of the previous one.

这篇关于标记组的开始和结束的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆