按日期对数据框进行分组:解决缺少时间段的错误 [英] Grouping a data frame by dates: resolve missing time periods' bug

查看:94
本文介绍了按日期对数据框进行分组:解决缺少时间段的错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

几周前,我在StackOverflow上从一个慷慨的响应者那里收到的一些不错的代码中发现了一个很难解决的错误,即使不是我自己创建的,也可以在今天使用一些新的帮助.

I've identified, if not myself created, a difficult bug to resolve in some nice code received from a generous respondent here on StackOverflow a few weeks ago, and I could use some new assistance today.

样本数据(以下称为对象eh):

Sample data (called object eh below):

    ID        2013-03-20 2013-04-09 2013-04-11 2013-04-17 2013-04-25 2013-05-15 2013-05-24 2013-05-25 2013-05-26
    5167f          0          0          0          0          0          0          0          0          0
    1214m          0          0          0          0          0          0          0          0          0
    1844f          0          0          0          0          0          0          0          0          0
    2113m          0          0          0          0          0          0          0          0          0
    2254m          0          0          0          0          0          0          0          0          0
    2721f          0          0          0          0          0          0          0          0          0
    3121f          0          0          0          0          0          0          0          0          0
    3486f          0          0          0          0          0          0          0          0          0
    3540f          0          0          0          0          0          0          0          0          0
    4175m          0          0          0          0          0          0          0          0          0

我需要能够按其各自的列日期所属的时间段(例如,每1、2、3或4周)对0s1s进行分组.每当1在特定日期范围(Period)内至少下降一次时,就会在该Period中为该ID汇总一个1(否则为0).

I need to be able to group 0s and 1s by the time period in which their respective column date falls (e.g., every 1, 2, 3, or 4 weeks). Whenever a 1 falls at least once within a specific date range (Period), then a 1 is summarized for that ID in that Period (0, else).

我以1周的摘要例程为例.我的主要问题是,在从"2013-03-20""2015-12-31"的时间序列中,最终生成的最终输出缺少某些可能的1周Periods.

I'm starting with the 1-week summary routine as an example. My main problem is that the final output generated lacks some of the total possible 1-week Periods during the time series "2013-03-20" to "2015-12-31".

在此示例输出中的通知,其中行表示唯一的IDs,列表示唯一的Periods,如何缺少Periods 2、5、7和9:

Notice in this example output, wherein the rows are for unique IDs and columns are for unique Periods, how Periods 2, 5, 7, and 9 are missing:

    1   3   4   6   8   10  11  12  13  14
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0

这里是对原始数据帧进行分组的完整例程(请参见上面共享的示例数据):

Here is the full routine for grouping the original data frame (see sample data shared above):

    #Convert to data table from original data frame, eh
    dt <- as.data.table(eh)

    #One week summarized encounter histories
    dt_merge <- data_frame(
      # Create a column showing the beginning date
      Date1 = seq(from = ymd("2013-03-20"), to = ymd("2015-12-31"), by = "1 week")) %>%
      # Create  a column showing the end date of each period
      mutate(Date2 = lead(Date1)) %>%
      # Adjust Date1
      mutate(Date1 = if_else(Date1 == ymd("2013-03-20"), Date1, Date1 + 1)) %>%
      # Remove the last row
      drop_na(Date2) %>%
      # Create date list
      mutate(Dates = map2(Date1, Date2, function(x, y){ seq(x, y, by = "day") })) %>%
      unnest() %>%
      # Create Group ID
      mutate(RunID = group_indices_(., dots. = c("Date1", "Date2"))) %>%
      # Create Period ID
      mutate(Period = paste0(RunID)) %>%
      # Add a column showing Month
      mutate(Month = month(Dates)) %>%
      # Add a column showing Year
      mutate(Year = year(Dates)) %>%
      # Add a column showing season
      mutate(Season = case_when(
        Month %in% 3:5            ~ "Spring",
        Month %in% 6:8            ~ "Summer",
        Month %in% 9:11           ~ "Fall",
        Month %in% c(12, 1, 2)    ~ "Winter",
        TRUE                      ~ NA_character_
      )) %>%
      # Combine Season and Year
      mutate(SeasonYear = paste0(Season, Year)) %>%
      select(-Date1, -Date2, -RunID)
    dt2 <- dt %>%
      # Reshape the data frame
      gather(Date, Value, -ID) %>%
      # Convert Date to date class
      mutate(Date = ymd(Date)) %>%
      # Join dt_merge
      left_join(dt_merge, by = c("Date" = "Dates")) 
    one.week <- dt2 %>%
      group_by(ID, Period) %>%
      summarise(Value = max(Value)) %>%
      spread(Period, Value)

    #Finished product
    one.week <- as.data.frame(one.week)

    #Missing weeks 2, 5, 7, and 9...
    one.week

有人可以帮助我了解我哪里出错了吗?预先感谢!

Can someone help me understand where I've gone wrong? Thanks in advance!

-AD

推荐答案

之所以发生这种情况,是因为eh数据缺少这些星期.例如,如果您查看组成第2周的日期:

This is happening because those weeks are missing from the eh data. For example, if you look at the dates that make up week 2:

dt_merge %>%
  filter(Period == 2)
#> # A tibble: 7 x 6
#>        Dates Period Month  Year Season SeasonYear
#>       <date>  <chr> <dbl> <dbl>  <chr>      <chr>
#> 1 2013-03-28      2     3  2013 Spring Spring2013
#> 2 2013-03-29      2     3  2013 Spring Spring2013
#> 3 2013-03-30      2     3  2013 Spring Spring2013
#> 4 2013-03-31      2     3  2013 Spring Spring2013
#> 5 2013-04-01      2     4  2013 Spring Spring2013
#> 6 2013-04-02      2     4  2013 Spring Spring2013
#> 7 2013-04-03      2     4  2013 Spring Spring2013

您可以看到eh列中没有这些日期,该日期从2013-03-20跳至2013-04-09.因为在创建dt2时使用了left_join,所以仅保留eh中的日期(因此是星期).

You can see that none of these dates are in the columns of eh, which skip from 2013-03-20 to 2013-04-09. Because you use a left_join when creating dt2, only dates (and therefore weeks) in eh are retained.

可以使用 tidyr 包中的complete()创建ID和日期的缺失组合来更正此问题.

This can be corrected by using complete() from the tidyr package to create the missing combinations of ID and Date.

dt2 <- dt %>%
  # Reshape the data frame
  gather(Date, Value, -ID) %>%
  # Convert Date to date class
  mutate(Date = ymd(Date)) %>%
  # Create missing ID/Date combinations
  complete(ID, Date = dt_merge$Dates) %>%
  # Join dt_merge
  left_join(dt_merge, by = c("Date" = "Dates"))
one.week <- dt2 %>%
  mutate(Period = as.numeric(Period)) %>%
  group_by(ID, Period) %>%
  summarise(Value = max(Value, na.rm = TRUE)) %>%
  spread(Period, Value)
one.week
#> # A tibble: 10 x 146
#> # Groups:   ID [10]
#>       ID   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`  `11`
#>  * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 1214m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  2 1844f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  3 2113m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  4 2254m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  5 2721f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  6 3121f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  7 3486f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  8 3540f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#>  9 4175m     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#> 10 5167f     0  -Inf     0     0  -Inf     0  -Inf     0  -Inf     0  -Inf
#> # ... with 134 more variables: `12` <dbl>, `13` <dbl>, `14` <dbl>,
#> #   `15` <dbl>, `16` <dbl>, `17` <dbl>, `18` <dbl>, `19` <dbl>,
#> #   `20` <dbl>, `21` <dbl>, `22` <dbl>, `23` <dbl>, `24` <dbl>,
#> #   `25` <dbl>, `26` <dbl>, `27` <dbl>, `28` <dbl>, `29` <dbl>,
#> #   `30` <dbl>, `31` <dbl>, `32` <dbl>, `33` <dbl>, `34` <dbl>,
#> #   `35` <dbl>, `36` <dbl>, `37` <dbl>, `38` <dbl>, `39` <dbl>,
#> #   `40` <dbl>, `41` <dbl>, `42` <dbl>, `43` <dbl>, `44` <dbl>,
#> #   `45` <dbl>, `46` <dbl>, `47` <dbl>, `48` <dbl>, `49` <dbl>,
#> #   `50` <dbl>, `51` <dbl>, `52` <dbl>, `53` <dbl>, `54` <dbl>,
#> #   `55` <dbl>, `56` <dbl>, `57` <dbl>, `58` <dbl>, `59` <dbl>,
#> #   `60` <dbl>, `61` <dbl>, `62` <dbl>, `63` <dbl>, `64` <dbl>,
#> #   `65` <dbl>, `66` <dbl>, `67` <dbl>, `68` <dbl>, `69` <dbl>,
#> #   `70` <dbl>, `71` <dbl>, `72` <dbl>, `73` <dbl>, `74` <dbl>,
#> #   `75` <dbl>, `76` <dbl>, `77` <dbl>, `78` <dbl>, `79` <dbl>,
#> #   `80` <dbl>, `81` <dbl>, `82` <dbl>, `83` <dbl>, `84` <dbl>,
#> #   `85` <dbl>, `86` <dbl>, `87` <dbl>, `88` <dbl>, `89` <dbl>,
#> #   `90` <dbl>, `91` <dbl>, `92` <dbl>, `93` <dbl>, `94` <dbl>,
#> #   `95` <dbl>, `96` <dbl>, `97` <dbl>, `98` <dbl>, `99` <dbl>,
#> #   `100` <dbl>, `101` <dbl>, `102` <dbl>, `103` <dbl>, `104` <dbl>,
#> #   `105` <dbl>, `106` <dbl>, `107` <dbl>, `108` <dbl>, `109` <dbl>,
#> #   `110` <dbl>, `111` <dbl>, ...

如果在给定的星期没有该ID的值,则返回-Inf.另外,也可以用complete(ID, Date = dt_merge$Dates, fill = list(Value = 0))将缺失值填充为例如0,而不用NA填充缺失值.对于任何未观察到的ID和日期组合,这会将Value变量设置为0.

Here -Inf is returned if there were no values for that ID in a given week. Alternatively, instead of filling the missing values with NA, they could be filled with, for example 0, using complete(ID, Date = dt_merge$Dates, fill = list(Value = 0)). This will make the Value variable 0 for any of the unobserved ID and Date combinations.

这篇关于按日期对数据框进行分组:解决缺少时间段的错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆