使用dplyr根据R中的类型和滚动日期进行计数和标记 [英] Using dplyr to count and mark based on type and rolling date in R
问题描述
我的问题类似于
这与我的示例非常接近,但它并不能说明唯一的日子,它确实不会保留表格中的所有列:
a%>%mutate(DATE = as.POSIXct(DATE,format =%m /%d /%Y%H:%M)))%>%
inner_join(。,。,by = TYPE)%&%;%
group_by(TYPE, DATE.x)%>%
summarise(FLAG = as.integer(sum(abs((DATE.x-DATE.y)/(24 * 60 * 60))< = 30)> = 4))
任何建议都会受到赞赏。
更新
以下两个答案均适用于我的原始示例数据,但是,如果添加一些更多的 D
实例,它们都将所有 D
标记为 1
而不是标记前4个实例 0
和后4个实例 1
,这是滚动窗口的位置
更新的数据集:
a< -data.table( TYPE = c( A, A, B, B,
C, C, C, C,
D, D, D, D,
D, D, D, D),
DATE = c( 4 / 20/2018 11:47,
4/25/2018 7:21,
4/15/2018 6:11,
4/19/2018 4 :22,
4/15/2018 17:46,
4/16/2018 11:59,
4/20/2018 7:50,
4/26/2018 2:55,
4/27/2018 11:46,
4/27/2018 13:03,
4 / 20/2 018 7:31,
2018/4/22 9:45,
6/01/2018 9:07,
6/03/2018 12:34 ,
6/07/2018 1:57,
6/10/2018 2:22),
CLASS = c(1、2、3、4
1,2,3,4,
1,2,3,4,
1,2,3,4))
新的更新预期输出为:
以下是dplyr的解决方案:
根据OP编辑进行更新
库(dplyr)
库(润滑)
a< ;-data.frame( TYPE = c( A, A, B, B,
C, C, C, C,
D, D, D, D,
D, D, D, D),
DATE = c( 4/20/2018 11 :47,
4/25/2018 7:21,
4/15/2018 6:11,
4/19/2018 4:22,
4/15/2018 17:46,
4/16/2018 11:59,
4/20/2018 7:50,
4 / 26/2018 2:55,
4/27/2018 11:46,
4/27/2018 13:03,
4/20/2018 7 :31,
4/22/2018 9:45,
6/01/2018 9:07,
6/03/2018 12:34,
6/07/2018 1:57,
6/10/2018 2:22),
CLASS = c(1,2,3,4,
1,2,3,4,
1 ,2,3,4,
1,2,3,4))
#一个函数来标记窗口w中第4个或更多行w
count_window <- function(df,date,w,type){
min_date<-日期-w
df2<-df%&%;%filter(TYPE == type,YMD> = min_date,YMD< ; = date)
out<-n_distinct(df2 $ YMD)
res<-ifelse(out> = 4,1,0)
return(res)
}
v_count_window<-Vectorize(count_window,vectorize.args = c( date, type))
res<-a%>%突变(DATE = as.POSIXct(DATE,format =%m /%d /%Y%H:%M))%>%
mutate(YMD = date(DATE))%>%
排列(TYPE,YMD)%>%
#group_by(TYPE)%>%
mutate(min_date = YMD-30,
count = v_count_window(。,YMD, 30,TYPE))%>%
group_by(TYPE)%>%
mutate(FLAG = case_when(
any(count == 1)& YMD> = min_date [match(1,count)]〜1,
TRUE〜0
))%>%
select(nms,FLAG)
我不知道如何在自定义函数中使用该组,因此我按类型将过滤硬编码到函数中。
My question is similar to dplyr: grouping and summarizing/mutating data with rolling time windows and I have used this for reference but have not been successful in manipulating it enough for what I need to do.
I have data that looks something like this:
a <- data.table("TYPE" = c("A", "A", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D"),
"DATE" = c("4/20/2018 11:47",
"4/25/2018 7:21",
"4/15/2018 6:11",
"4/19/2018 4:22",
"4/15/2018 17:46",
"4/16/2018 11:59",
"4/20/2018 7:50",
"4/26/2018 2:55",
"4/27/2018 11:46",
"4/27/2018 13:03",
"4/20/2018 7:31",
"4/22/2018 9:45"),
"CLASS" = c(1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4))
From this I ordered the data first by TYPE
and then by DATE
and created a column that just contains the date and ignores the time from the DATE
column:
a <- a[order(TYPE, DATE), ]
a[, YMD := date(a$DATE)]
Now I am trying to use the TYPE
column and YMD
column to produce a new column. Here is the criteria I am trying to meet:
1) Maintain all columns from the original data set
2) Create a new column called say EVENTS
3) For each TYPE
if it occurs more than n
times within 30 days then put Y
in the EVENTS
column for each TYPE
and YMD
that made the group qualify and N
otherwise. (Note this is for n
unique dates, so it must have n
unique days within 30 days to qualify).
This would be the expected output if n = 4
:
This is as close of an example that I have, but it does not account for unique days and it does not preserve all of the columns in the table:
a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
inner_join(.,., by="TYPE") %>%
group_by(TYPE, DATE.x) %>%
summarise(FLAG = as.integer(sum(abs((DATE.x-DATE.y)/(24*60*60))<=30)>=4))
Any suggestions are appreciated.
Update
Both of the answers below worked for my original example data, however, if we add a few more instances of D
then they both mark all of D
as 1
instead of marking the first 4 instances 0
and the last 4 instances 1
this is where the "rolling window" comes into play.
Updated data set:
a <- data.table("TYPE" = c("A", "A", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D",
"D", "D", "D", "D"),
"DATE" = c("4/20/2018 11:47",
"4/25/2018 7:21",
"4/15/2018 6:11",
"4/19/2018 4:22",
"4/15/2018 17:46",
"4/16/2018 11:59",
"4/20/2018 7:50",
"4/26/2018 2:55",
"4/27/2018 11:46",
"4/27/2018 13:03",
"4/20/2018 7:31",
"4/22/2018 9:45",
"6/01/2018 9:07",
"6/03/2018 12:34",
"6/07/2018 1:57",
"6/10/2018 2:22"),
"CLASS" = c(1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4))
The new update expected output would be:
Here is a solution with dplyr:
Update based on OP edit
library(dplyr)
library(lubridate)
a <- data.frame("TYPE" = c("A", "A", "B", "B",
"C", "C", "C", "C",
"D", "D", "D", "D",
"D", "D", "D", "D"),
"DATE" = c("4/20/2018 11:47",
"4/25/2018 7:21",
"4/15/2018 6:11",
"4/19/2018 4:22",
"4/15/2018 17:46",
"4/16/2018 11:59",
"4/20/2018 7:50",
"4/26/2018 2:55",
"4/27/2018 11:46",
"4/27/2018 13:03",
"4/20/2018 7:31",
"4/22/2018 9:45",
"6/01/2018 9:07",
"6/03/2018 12:34",
"6/07/2018 1:57",
"6/10/2018 2:22"),
"CLASS" = c(1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4,
1, 2, 3, 4))
# a function to flag rows that are 4th or more within window w
count_window <- function(df, date, w, type){
min_date <- date - w
df2 <- df %>% filter(TYPE == type, YMD >= min_date, YMD <= date)
out <- n_distinct(df2$YMD)
res <- ifelse(out >= 4, 1, 0)
return(res)
}
v_count_window <- Vectorize(count_window, vectorize.args = c("date","type"))
res <- a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
mutate(YMD = date(DATE)) %>%
arrange(TYPE, YMD) %>%
#group_by(TYPE) %>%
mutate(min_date = YMD - 30,
count = v_count_window(., YMD, 30, TYPE)) %>%
group_by(TYPE) %>%
mutate(FLAG = case_when(
any(count == 1) & YMD >= min_date[match(1,count)] ~ 1,
TRUE ~ 0
))%>%
select(nms,FLAG)
I couldn't figure out how to use the group in a custom function so I hard coded the filtering by type into the function.
这篇关于使用dplyr根据R中的类型和滚动日期进行计数和标记的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!