时间序列中事件的滚动计数 [英] Rolling Count of Events Over Time Series

查看:98
本文介绍了时间序列中事件的滚动计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在时间范围内按组计算滚动计数/发生总数.

我有一个带有一些示例数据的数据框,如下所示:

dates = as.Date(c("2011-10-09",
        "2011-10-15",
        "2011-10-16", 
        "2011-10-18", 
        "2011-10-21", 
        "2011-10-22", 
        "2011-10-24"))

group1=c("A",
         "C",
         "A", 
         "A", 
         "L", 
         "F", 
         "A")
group2=c("D",
         "A",
         "B", 
         "H", 
         "A", 
         "A", 
         "E")

df1 <- data.frame(dates, group1, group2)

我为每个唯一的组"迭代各个数据帧,例如,这就是"A"的组的外观(它们出现在每一行中,无论是在group1还是在group2中).

我想计算一个时间范围内事件发生的次数(事件的日期"(即当前行日期)和前4天)("A"(然后是每个组)).我想将其向前滚动,例如,第1行的计数为1,第2行的计数也为1(除当前日期以外,过去4天中没有任何事件),第3行的计数为2,第2行4个将有3个,等等

对于每一行,我想最后列出一列,该列基本上说,在该事件日期,当前日期(如日期列中所示)和最近4天.

解决方案

在此示例中,您可以使用sapply分析每一行,计算当天或最多4天之前的条目数,例如:

df1$lastFour <-
  sapply(df1$dates, function(x){
    sum(df1$dates <= x & df1$dates >= x - 4)
  })

结果在df1

       dates group1 group2 lastFour
1 2011-10-09      A      D        1
2 2011-10-15      C      A        1
3 2011-10-16      A      B        2
4 2011-10-18      A      H        3
5 2011-10-21      L      A        2
6 2011-10-22      F      A        3
7 2011-10-24      A      E        3

如果,正如您的问题所暗示的那样,您的数据来自较大的一组,并且您希望对每个组进行分析(从概念上讲,我认为问题是:该组有多少事件在过去四天内?仅在有该组活动的几天内被询问),您可以按照以下步骤操作.

首先,这是一些较大的示例数据,其组标记为字母表的前10个字母:

biggerData <-
  data.frame(
    dates = sample(seq(as.Date("2011-10-01")
                       , as.Date("2011-10-31")
                       , 1)
                   , 100, TRUE)
    , group1 = sample(LETTERS[1:10], 100, TRUE)
    , group2 = sample(LETTERS[1:10], 100, TRUE)
  )

接下来,我提取数据中的所有组(在这里,我知道它们,但是对于您的真实数据,您可能已经或可能没有该组列表)

groupsInData <-
  sort(unique(c(as.character(biggerData$group1)
                , as.character(biggerData$group2))))

然后,我遍历该组名称的向量,并提取该组为两个组之一的每个事件,添加与上述相同的列,并将单独的data.frames保存在列表中(并命名它们以便更轻松地访问/跟踪它们.

sepGroupCounts <- lapply(groupsInData, function(thisGroup){
  dfTemp <- biggerData[biggerData$group1 == thisGroup | 
                         biggerData$group2 == thisGroup, ]

  dfTemp$lastFour <-
    sapply(dfTemp$dates, function(x){
      sum(dfTemp$dates <= x & dfTemp$dates >= x - 4)
    })
  return(dfTemp)

}) 

names(sepGroupCounts) <- groupsInData

为数据中的每个组都像上面一样返回一个data.frame.

而且,我无能为力,所以这里也是dplyrtidyr解决方案.它与上面的基于列表的解决方案没有什么不同,除了它在同一data.frame中返回所有内容(这可能是好事也可能不是什么好事,特别是因为这样每个事件都有两个条目)./p>

首先,为简单起见,我定义了一个函数来进行日期检查.上面也可以很容易地使用它.

myDateCheckFunction <- function(x){
  sapply(x, function(thisX){
    sum(x <= thisX & x >= thisX - 4 )
  })
}

接下来,我正在构建一组逻辑测试,这些逻辑测试将确定是否存在每个组.这些将用于为每个组生成列,为每个事件中的当前/不存在给出TRUE/FALSE.

dotsConstruct <-
  paste0("group1 == '", groupsInData, "' | "
         , "group2 == '", groupsInData, "'") %>%
  setNames(groupsInData)

最后,将其全部放入一个管道调用中.我没有描述,而是对每个步骤都进行了评论.

withLastFour <-
  # Start with data
  biggerData %>%
  # Add a col for each group using Standard Evaluation
  mutate_(.dots = dotsConstruct) %>%
  # convert to long form; one row per group per event
  gather(GroupAnalyzed, Present, -dates, -group1, -group2) %>%
  # Limit to only rows where the `GroupAnalyzed` is present
  filter(Present) %>%
  # Remove the `Present` column, as it is now all "TRUE"
  select(-Present) %>%
  # Group by the groups we are analyzing
  group_by(GroupAnalyzed) %>%
  # Add the column for count in the last four dates
  # `group_by` limits this to just counts within that group
  mutate(lastFour = myDateCheckFunction(dates)) %>%
  # Sort by group and date for prettier checking
  arrange(GroupAnalyzed, dates)

结果与上面的list输出类似,不同之处在于一个data.frame中的所有内容,这可能使某些功能的分析更加容易.顶部看起来像这样:

       dates group1 group2 GroupAnalyzed lastFour
      <date> <fctr> <fctr>         <chr>    <int>
1 2011-10-01      B      A             A        1
2 2011-10-02      J      A             A        2
3 2011-10-05      C      A             A        5
4 2011-10-05      C      A             A        5
5 2011-10-05      G      A             A        5
6 2011-10-08      E      A             A        5

请注意,我的随机样本在05年10月发生了多个事件,导致此处的计数很高.

I'm trying to calculate a rolling count/sum of occurrences by group over the series of a time frame.

I have a data frame with some sample data like this:

dates = as.Date(c("2011-10-09",
        "2011-10-15",
        "2011-10-16", 
        "2011-10-18", 
        "2011-10-21", 
        "2011-10-22", 
        "2011-10-24"))

group1=c("A",
         "C",
         "A", 
         "A", 
         "L", 
         "F", 
         "A")
group2=c("D",
         "A",
         "B", 
         "H", 
         "A", 
         "A", 
         "E")

df1 <- data.frame(dates, group1, group2)

I iterate individual data frames for each unique 'group', so for example this is how the group for "A" would look (they are present in every row, whether in group1 or group2).

I want to count for "A" (and then each group later on) the number of event occurrences in a time range - the 'date' of the event (i.e., the present row date) and the previous 4 days. I want to roll that forward, so for example row 1 would have a count of 1, row 2 would also have a count of 1 (no events in the past 4 days aside from that present date), row 3 would have 2, row 4 would have 3 etc.

For each row, I'd like to end up with a column that basically says, on this event date, there are X number of events that have occurred on the present date (as indicated in the date column) and in the last 4 days.

解决方案

For this example, you can probably use sapply to analyze each row, counting the number of entries on that day or up to 4 days earlier, like so:

df1$lastFour <-
  sapply(df1$dates, function(x){
    sum(df1$dates <= x & df1$dates >= x - 4)
  })

Results in df1 of:

       dates group1 group2 lastFour
1 2011-10-09      A      D        1
2 2011-10-15      C      A        1
3 2011-10-16      A      B        2
4 2011-10-18      A      H        3
5 2011-10-21      L      A        2
6 2011-10-22      F      A        3
7 2011-10-24      A      E        3

If, as your question implies, your data are from a larger set and you want to do the analysis on each group (conceptually, I think the question is: how many events have had this group in the last four days? asked only on days with an event from that group), you could follow the steps below.

First, here are some larger sample data with groups labelled as the first 10 letters of the alphabet:

biggerData <-
  data.frame(
    dates = sample(seq(as.Date("2011-10-01")
                       , as.Date("2011-10-31")
                       , 1)
                   , 100, TRUE)
    , group1 = sample(LETTERS[1:10], 100, TRUE)
    , group2 = sample(LETTERS[1:10], 100, TRUE)
  )

Next, I extract all of the groups in the data (here, I know them, but for your real data, you may or may not have that list of groups already)

groupsInData <-
  sort(unique(c(as.character(biggerData$group1)
                , as.character(biggerData$group2))))

Then, I loop through that vector of group names and extract each of the events with that group as one of the two groups, adding the same column as above, and saving the separate data.frames in a list (and naming them to make it easier to access/track them).

sepGroupCounts <- lapply(groupsInData, function(thisGroup){
  dfTemp <- biggerData[biggerData$group1 == thisGroup | 
                         biggerData$group2 == thisGroup, ]

  dfTemp$lastFour <-
    sapply(dfTemp$dates, function(x){
      sum(dfTemp$dates <= x & dfTemp$dates >= x - 4)
    })
  return(dfTemp)

}) 

names(sepGroupCounts) <- groupsInData

returns a data.frame just like above for each of the groups in your data.

And, I couldn't help myself, so here is a dplyr and tidyr solution as well. It is not much different than the list-based solution above, except that it returns everything in the same data.frame (which may or may not be a good thing, particularly as it will have two entries for each event this way).

First, for simplicity, I defined a function to do the date checking. This could easily be used above as well.

myDateCheckFunction <- function(x){
  sapply(x, function(thisX){
    sum(x <= thisX & x >= thisX - 4 )
  })
}

Next, I am constructing a set of logical tests that will determine whether or not each of the groups is present. These will be used to generate columns for each group, giving TRUE/FALSE for present/absent in each event.

dotsConstruct <-
  paste0("group1 == '", groupsInData, "' | "
         , "group2 == '", groupsInData, "'") %>%
  setNames(groupsInData)

Finally, putting it altogether in one piped call. Instead of describing, I have commented each step.

withLastFour <-
  # Start with data
  biggerData %>%
  # Add a col for each group using Standard Evaluation
  mutate_(.dots = dotsConstruct) %>%
  # convert to long form; one row per group per event
  gather(GroupAnalyzed, Present, -dates, -group1, -group2) %>%
  # Limit to only rows where the `GroupAnalyzed` is present
  filter(Present) %>%
  # Remove the `Present` column, as it is now all "TRUE"
  select(-Present) %>%
  # Group by the groups we are analyzing
  group_by(GroupAnalyzed) %>%
  # Add the column for count in the last four dates
  # `group_by` limits this to just counts within that group
  mutate(lastFour = myDateCheckFunction(dates)) %>%
  # Sort by group and date for prettier checking
  arrange(GroupAnalyzed, dates)

The result is similar to the above list output, except with everything in one data.frame, which may allow for easier analysis of some features. The top looks like this:

       dates group1 group2 GroupAnalyzed lastFour
      <date> <fctr> <fctr>         <chr>    <int>
1 2011-10-01      B      A             A        1
2 2011-10-02      J      A             A        2
3 2011-10-05      C      A             A        5
4 2011-10-05      C      A             A        5
5 2011-10-05      G      A             A        5
6 2011-10-08      E      A             A        5

Note that my random sample had multiple events on Oct-05, leading to the large counts here.

这篇关于时间序列中事件的滚动计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆