根据日期合并长格式的两个数据帧 [英] Merging two data frames in long format based on date
问题描述
我有2个数据框,一个( df1
)记录了每天发生的不同活动,另一个( df2
)记录当天发生的活动的属性。
I have a 2 data frames, one (df1
) that records the daily occurrence of different activities and another (df2
) that records properties of the occurred activity during the day.
从 df1
可以识别重复活动以及持续时间。一天的开始时间由 Date
变量指定。
From df1
it is possible to identify the repeated occurrence of an activity as well the duration. When the day starts is specified by the Date
variable.
例如:
-
id
12发生从第1天开始,到d7结束。在这种情况下,发生次数为7,持续时间为11。 - 对于
id
123,该周从第5天开始,直到d7;由于在第6天有间隔天,持续时间为6,id 123(从第6天开始,到第7天结束)连续发生了两次,持续时间为6,所以重复出现。
id
12 the occurrence starts at day1 and ends at d7. In this case the occurrence is 7 and duration is 11.- for
id
123 the week starts at day 5 and ends at d7; occurred in repeated order because of there are gap days at day 6 and duration is 6 and id 123 (starts at day6 ends at day 7) occurred 2 times consecutively and duration 6.
在 df1
中,变量Date定义记录开始的日期。例如,id 12记录从第1天开始,依此类推。
In df1
the variable Date defines the day when the record started. For example id 12 record started at day1 and so on.
我想确定在连续出现期间是否在<$ c $中有关于活动属性的记录c> df2 。
I would like to identify if during the consecutive occurrence if there are records on the activity properties in df2
.
例如id 12,出现7次,持续时间为12,则有星期三的记录( df1
中的day3)并且该记录对应于连续发生的3天。对于id 123,没有数据(例如,没有连续出现),但对于id 6,它连续6天出现,持续时间18,则在第6天有记录。
For example id 12, occurred 7 times and duration is 12 there is record for Wednesday (day3 in df1
) and this record corresponds to the 3 day of the consecutive occurrence. For id 123 there is no data (eg. no consecutive occurrence) but for id 10 for 6 day occurrence and duration 18 there is a record on the 6th day.
Df1 :
id day1 day2 day3 day4 day5 day6 day7 Date
12 2 1 2 1 1 3 1 Mon
123 0 3 0 3 3 0 3 Fri
10 0 3 3 3 3 3 3 Sat
Df2:
id c1 c2 Date
12 3 3 Wednesday
123 3 2 Fri
10 3 1 Sat
结果:
id c1 c2 Occurrence Position
12 3 3 7 3
123 0 0 0 0
10 3 1 2 1
样本数据:df1
structure(list(id = c(12L, 123L, 10L), day1 = c(2L, 0L, 3L),
day2 = c(1L, 3L, 3L), day3 = c(2L, 0L, 3L), day4 = c(1L,
3L, 3L), day5 = c(1L, 3L, 3L), day6 = c(3L, 0L, 3L), day7 = c(1L,
3L, 3L), Date = c("Monday", "Friday", "Saturday")), row.names = c(NA,
-3L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000002a81a571ef0>)
df2:
structure(list(id = c(12, 123, 10), c1 = c(3, 3, 3), c2 = c(3,
2, 1), Date = structure(c(3L, 1L, 2L), .Label = c("Friday", "Saturday",
"Wednesday"), class = "factor")), row.names = c(NA, -3L), class = "data.frame")
推荐答案
使用 dplyr
(也许不是最短的):
A solution with dplyr
(maybe not the shortest one):
# library
library(tidyverse)
# get data
df1 <- structure(list(id = c(12L, 123L, 10L),
day1 = c(2L, 0L, 3L),
day2 = c(1L, 3L, 3L),
day3 = c(2L, 0L, 3L),
day4 = c(1L,3L, 3L),
day5 = c(1L, 3L, 3L),
day6 = c(3L, 0L, 3L),
day7 = c(1L,3L, 3L),
Date = c("Monday", "Friday", "Saturday")),
row.names = c(NA,-3L), class = c("data.table", "data.frame"))
df2 <- structure(list(id = c(12, 123, 10),
c1 = c(3, 3, 3),
c2 = c(3, 2, 1),
Date = structure(c(3L, 1L, 2L), .Label = c("Friday", "Saturday","Wednesday"),
class = "factor")), row.names = c(NA, -3L), class = "data.frame")
# change days to nummeric (will help you later)
df1 %>% mutate(
Date_nr_df1=case_when(
Date=="Monday" ~ 1,
Date=="Tuesday" ~2,
Date=="Wednesday" ~3,
Date=="Thursday" ~4,
Date=="Friday" ~5,
Date=="Saturday" ~6,
Date=="Sunday" ~7)) -> df1
df2 %>% mutate(
Date_nr_df2=case_when(
Date=="Monday" ~ 1,
Date=="Tuesday" ~2,
Date=="Wednesday" ~3,
Date=="Thursday" ~4,
Date=="Friday" ~5,
Date=="Saturday" ~6,
Date=="Sunday" ~7)) -> df2
# combine data by the id column
left_join(df1,df2, by=c("id")) -> df
# adjust data
df %>%
group_by(id) %>% # to make changes per row
mutate(days=paste0(day1,day2,day3,day4,day5,day6,day7)) %>% #pastes the values together
mutate(days_correct=substring(days,Date_nr_df1)) %>% # applies the start day
mutate(Occurrence_seq=str_split(days_correct, fixed("0"))[[1]][1]) %>% # extracts all days before 0
mutate(Occurrence=nchar(Occurrence_seq)) %>% ## counts these days
mutate(Occurrence=case_when(Occurrence==1 ~ 0, TRUE ~ as.numeric(Occurrence))) %>% # sets Occurrence to 0 if there is no consecutive occurrence
mutate(Position=Date_nr_df2-Date_nr_df1+1) %>% ## calculates the position you wanted
mutate(c1=case_when(Occurrence==0 ~0, TRUE ~ c1),
c2=case_when(Occurrence==0 ~0, TRUE ~c1),
Position=case_when(Occurrence==0 ~ 0, TRUE ~ as.numeric(Position))) %>%
ungroup() %>% ungroups the df
select(id,c1,c2,Occurrence,Position) # selects the wanted variables
#> # A tibble: 3 x 5
#> id c1 c2 Occurrence Position
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 12 3 3 7 3
#> 2 123 0 0 0 0
#> 3 10 3 3 2 1
reprex包(v0.2.1)
Created on 2020-04-10 by the reprex package (v0.2.1)
这篇关于根据日期合并长格式的两个数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!