使用dplyr根据R中的类型和滚动日期进行计数和标记 [英] Using dplyr to count and mark based on type and rolling date in R

查看：90 发布时间：2020/10/26 4:15:45 r dplyr

本文介绍了使用dplyr根据R中的类型和滚动日期进行计数和标记的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的问题类似于

这与我的示例非常接近，但它并不能说明唯一的日子，它确实不会保留表格中的所有列：

  a％>％mutate（DATE = as.POSIXct（DATE，format =％m /％d /％Y％H：％M）））％>％
 inner_join（。，。，by = TYPE）％&％;％
 group_by（TYPE， DATE.x）％>％
 summarise（FLAG = as.integer（sum（abs（（DATE.x-DATE.y）/（24 * 60 * 60））< = 30）> = 4））

任何建议都会受到赞赏。

更新

以下两个答案均适用于我的原始示例数据，但是，如果添加一些更多的 D 实例，它们都将所有 D 标记为 1 而不是标记前4个实例 0 和后4个实例 1 ，这是滚动窗口的位置

更新的数据集：

  a< -data.table（ TYPE = c（ A， A， B， B，
 C， C， C， C，
 D， D， D， D，
 D， D， D， D），
 DATE = c（ 4 / 20/2018 11:47，
 4/25/2018 7:21，
 4/15/2018 6:11，
 4/19/2018 4 ：22，
 4/15/2018 17:46，
 4/16/2018 11:59，
 4/20/2018 7:50，
 4/26/2018 2:55，
 4/27/2018 11:46，
 4/27/2018 13:03，
 4 / 20/2 018 7:31，
 2018/4/22 9:45，
 6/01/2018 9:07，
 6/03/2018 12:34 ，
 6/07/2018 1:57，
 6/10/2018 2:22），
 CLASS = c（1、2、3、4 
 1，2，3，4，
 1，2，3，4，
 1，2，3，4））

新的更新预期输出为：

解决方案

以下是dplyr的解决方案：

根据OP编辑进行更新

 库（dplyr）
库（润滑）
a< ;-data.frame（ TYPE = c（ A， A， B， B，
 C， C， C， C，
 D，  D， D， D，
 D， D， D， D），
 DATE = c（ 4/20/2018 11 ：47，
 4/25/2018 7:21，
 4/15/2018 6:11，
 4/19/2018 4:22，
 4/15/2018 17:46，
 4/16/2018 11:59，
 4/20/2018 7:50，
 4 / 26/2018 2:55，
 4/27/2018 11:46，
 4/27/2018 13:03，
 4/20/2018 7 ：31，
 4/22/2018 9:45，
 6/01/2018 9:07，
 6/03/2018 12:34，
 6/07/2018 1:57，
 6/10/2018 2:22），
 CLASS = c（1,2,3,4，
 1，2，3，4，
 1 ，2，3，4，
 1，2，3，4））
 
＃一个函数来标记窗口w中第4个或更多行w 
 count_window <- function（df，date，w，type）{
 min_date<-日期-w 
 df2<-df％&％;％filter（TYPE == type，YMD> = min_date，YMD< ; = date）
 out<-n_distinct（df2 $ YMD）
 res<-ifelse（out> = 4，1，0）
 return（res）
 } 
 
 v_count_window<-Vectorize（count_window，vectorize.args = c（ date， type））
 
 res<-a％>％突变（DATE = as.POSIXct（DATE，format =％m /％d /％Y％H：％M））％>％
 mutate（YMD = date（DATE））％>％
排列（TYPE，YMD）％>％
 #group_by（TYPE）％>％
 mutate（min_date = YMD-30，
 count = v_count_window（。，YMD， 30，TYPE））％>％
 group_by（TYPE）％>％
 mutate（FLAG = case_when（
 any（count == 1）& YMD> = min_date [match（1，count）]〜1，
 TRUE〜0 
））％>％
 select（nms，FLAG）

我不知道如何在自定义函数中使用该组，因此我按类型将过滤硬编码到函数中。

My question is similar to dplyr: grouping and summarizing/mutating data with rolling time windows and I have used this for reference but have not been successful in manipulating it enough for what I need to do.

I have data that looks something like this:

a <- data.table("TYPE" = c("A", "A", "B", "B",
                       "C", "C", "C", "C",
                       "D", "D", "D", "D"), 
            "DATE" = c("4/20/2018 11:47",
                       "4/25/2018 7:21",
                       "4/15/2018 6:11",
                       "4/19/2018 4:22",
                       "4/15/2018 17:46",
                       "4/16/2018 11:59",
                       "4/20/2018 7:50",
                       "4/26/2018 2:55",
                       "4/27/2018 11:46",
                       "4/27/2018 13:03",
                       "4/20/2018 7:31",
                       "4/22/2018 9:45"),
            "CLASS" = c(1, 2, 3, 4,
                        1, 2, 3, 4,
                        1, 2, 3, 4))

From this I ordered the data first by TYPE and then by DATE and created a column that just contains the date and ignores the time from the DATE column:

a <- a[order(TYPE, DATE), ]
a[, YMD := date(a$DATE)]

Now I am trying to use the TYPE column and YMD column to produce a new column. Here is the criteria I am trying to meet:
1) Maintain all columns from the original data set
2) Create a new column called say EVENTS
3) For each TYPE if it occurs more than n times within 30 days then put Y in the EVENTS column for each TYPE and YMD that made the group qualify and N otherwise. (Note this is for n unique dates, so it must have n unique days within 30 days to qualify).

This would be the expected output if n = 4:

This is as close of an example that I have, but it does not account for unique days and it does not preserve all of the columns in the table:

a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
  inner_join(.,., by="TYPE") %>%
  group_by(TYPE, DATE.x) %>%
  summarise(FLAG = as.integer(sum(abs((DATE.x-DATE.y)/(24*60*60))<=30)>=4))

Any suggestions are appreciated.

Update

Both of the answers below worked for my original example data, however, if we add a few more instances of D then they both mark all of D as 1 instead of marking the first 4 instances 0 and the last 4 instances 1 this is where the "rolling window" comes into play.

Updated data set:

a <- data.table("TYPE" = c("A", "A", "B", "B",
                       "C", "C", "C", "C",
                       "D", "D", "D", "D",
                       "D", "D", "D", "D"), 
            "DATE" = c("4/20/2018 11:47",
                       "4/25/2018 7:21",
                       "4/15/2018 6:11",
                       "4/19/2018 4:22",
                       "4/15/2018 17:46",
                       "4/16/2018 11:59",
                       "4/20/2018 7:50",
                       "4/26/2018 2:55",
                       "4/27/2018 11:46",
                       "4/27/2018 13:03",
                       "4/20/2018 7:31",
                       "4/22/2018 9:45",
                       "6/01/2018 9:07",
                       "6/03/2018 12:34",
                       "6/07/2018 1:57",
                       "6/10/2018 2:22"),
            "CLASS" = c(1, 2, 3, 4,
                        1, 2, 3, 4,
                        1, 2, 3, 4,
                        1, 2, 3, 4))

The new update expected output would be:

解决方案

Here is a solution with dplyr:

Update based on OP edit

library(dplyr)
library(lubridate)
a <- data.frame("TYPE" = c("A", "A", "B", "B",
                           "C", "C", "C", "C",
                           "D", "D", "D", "D",
                           "D", "D", "D", "D"), 
                "DATE" = c("4/20/2018 11:47",
                           "4/25/2018 7:21",
                           "4/15/2018 6:11",
                           "4/19/2018 4:22",
                           "4/15/2018 17:46",
                           "4/16/2018 11:59",
                           "4/20/2018 7:50",
                           "4/26/2018 2:55",
                           "4/27/2018 11:46",
                           "4/27/2018 13:03",
                           "4/20/2018 7:31",
                           "4/22/2018 9:45",
                           "6/01/2018 9:07",
                           "6/03/2018 12:34",
                           "6/07/2018 1:57",
                           "6/10/2018 2:22"),
                "CLASS" = c(1, 2, 3, 4,
                            1, 2, 3, 4,
                            1, 2, 3, 4,
                            1, 2, 3, 4))

# a function to flag rows that are 4th or more within window w
count_window <- function(df, date, w, type){
  min_date <- date - w
  df2 <- df %>% filter(TYPE == type, YMD >= min_date, YMD <= date)
  out <- n_distinct(df2$YMD)
  res <- ifelse(out >= 4, 1, 0)
  return(res)
}

v_count_window <- Vectorize(count_window, vectorize.args = c("date","type"))

res <- a %>% mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %H:%M")) %>%
  mutate(YMD = date(DATE)) %>% 
  arrange(TYPE, YMD) %>% 
  #group_by(TYPE) %>% 
  mutate(min_date = YMD - 30,
         count = v_count_window(., YMD, 30, TYPE)) %>% 
  group_by(TYPE) %>% 
  mutate(FLAG = case_when(
    any(count == 1) & YMD >= min_date[match(1,count)] ~ 1,
    TRUE ~ 0
  ))%>% 
  select(nms,FLAG)

I couldn't figure out how to use the group in a custom function so I hard coded the filtering by type into the function.

这篇关于使用dplyr根据R中的类型和滚动日期进行计数和标记的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用dplyr根据R中的类型和滚动日期进行计数和标记 [英] Using dplyr to count and mark based on type and rolling date in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用dplyr根据R中的类型和滚动日期进行计数和标记 [英] Using dplyr to count and mark based on type and rolling date in R

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭