提取时间重新加入 [英] Extracting time to re-join
问题描述
我的表格结构如下,其中包含有关俱乐部会员资格的信息
id开始结束已取消1 2020-01-01 2020-12-31 2021-01-101 2021-02-01 2022-01-31不适用2 2020-01-01 2020-12-31不适用3 2020-01-01 2020-06-30 2020-07-013 2020-07-10 2021-01-09 2021-01-313 2021-02-02 2021-08-01不适用
我对之后重新加入的成员感兴趣.对于上述数据,在整个期间, id 1
都执行一次, id 2
则不执行,而 id 3
则执行两次./p>
对于每次发生这种情况,他们的成员资格终止与重新加入之间的时间是多少?对于以上数据,应为:
id rejoin_date time_to_rejoin1 2021-01-31 22天3 2020-07-10 9天3 2021-02-02 2天
对于R:
structure(list(id = c(1,1,2,3,3,3),start = c("2020-01-01","2021-02-01","2020-01-01","2020-01-01","2020-07-10","2021-02-02"),end = c("2020-12-31","2022-01-31","2020-12-31","2020-06-30",已取消的"2021-01-09","2021-08-01")= c("2021-01-10",NA,NA,"2020-07-01","2021-01-31",NA)),类="data.frame",row.names = c(NA,-6L))
对于SQL:
在不存在`dt`的情况下创建表(`id` int(6)unsigned NOT NULL,开始"时间戳,`结束`时间戳,取消TIMESTAMP空)DEFAULT CHARSET = utf8;将值插入`dt`(`id`,`start`,`end`,`cancelled`)值("1","2020-01-01","2020-12-31","2021-01-10"),('2','2021-02-01','2022-01-31',NULL),('2','2021-01-01','2020-12-31',NULL),("3","2020-01-01","2020-06-30","2020-07-01"),("3","2021-02-10","2021-01-09","2021-01-31"),('3','2021-02-02','2021-08-01',NULL)
对于使用R(理想情况下使用dplyr)或SQL(MySQL)的答案,我会感到满意
您可以使用 lag()
然后进行过滤:
选择t.*,datediff(开始,prev_cancelled)为num_days_since_cancel从(选择t.*,滞后(取消)超过(按开始按ID顺序对分区进行划分)为prev_cancelled从T)吨其中prev_cancelled不为null;
此处是db小提琴.
>I have a table structured as follows, which contains information about club memberships
id start end cancelled
1 2020-01-01 2020-12-31 2021-01-10
1 2021-02-01 2022-01-31 NA
2 2020-01-01 2020-12-31 NA
3 2020-01-01 2020-06-30 2020-07-01
3 2020-07-10 2021-01-09 2021-01-31
3 2021-02-02 2021-08-01 NA
I am interested in members that re-joined afterwards. For the above data, for the entire period, id 1
did this once, id 2
did not, and id 3
did so twice.
Ffor each time that this occurred, what was the time between their membership ending, and them rejoining ? For the above data, this would be:
id rejoin_date time_to_rejoin
1 2021-01-31 22 days
3 2020-07-10 9 days
3 2021-02-02 2 days
For R:
structure(list(id = c(1, 1, 2, 3, 3, 3), start = c("2020-01-01",
"2021-02-01", "2020-01-01", "2020-01-01", "2020-07-10", "2021-02-02"
), end = c("2020-12-31", "2022-01-31", "2020-12-31", "2020-06-30",
"2021-01-09", "2021-08-01"), cancelled = c("2021-01-10", NA,
NA, "2020-07-01", "2021-01-31", NA)), class = "data.frame", row.names = c(NA,
-6L))
For SQL:
CREATE TABLE IF NOT EXISTS `dt` (
`id` int(6) unsigned NOT NULL,
`start` TIMESTAMP,
`end` TIMESTAMP,
`cancelled` TIMESTAMP NULL
) DEFAULT CHARSET=utf8;
INSERT INTO `dt` (`id`, `start`, `end`, `cancelled`) VALUES
('1', '2020-01-01', '2020-12-31', '2021-01-10'),
('2', '2021-02-01', '2022-01-31', NULL ),
('2', '2021-01-01', '2020-12-31', NULL ),
('3', '2020-01-01', '2020-06-30', '2020-07-01'),
('3', '2021-02-10', '2021-01-09', '2021-01-31'),
('3', '2021-02-02', '2021-08-01', NULL )
I would be happy with answers that use either R (ideally using dplyr) or SQL (MySQL)
You can use lag()
and then filter:
select t.*,
datediff(start, prev_cancelled) as num_days_since_cancel
from (select t.*,
lag(cancelled) over (partition by id order by start) as prev_cancelled
from t
) t
where prev_cancelled is not null;
Here is a db<>fiddle.
这篇关于提取时间重新加入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!