如何根据日期计算汇总数据 [英] How to summarise data based on calculations on dates

查看:91
本文介绍了如何根据日期计算汇总数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有如下数据(注意日期采用DD-MM-YYYY格式):

I have data that looks like this (note dates are in DD-MM-YYYY format):

ID  date      drug  score
A   28/08/2016  2   3
A   29/08/2016  1   4
A   30/08/2016  2   4
A   2/09/2016   2   4
A   3/09/2016   1   4
A   4/09/2016   2   4
B   8/08/2016   1   3
B   9/08/2016   2   4
B   10/08/2016  2   3
B   11/08/2016  1   3
C   30/11/2016  2   4
C   2/12/2016   1   5
C   3/12/2016   2   1
C   5/12/2016   1   4
C   6/12/2016   2   4
C   8/12/2016   1   2
C   9/12/2016   1   2    

对于药物:1 =服用药物,2 =不服用药物。

For 'drug': 1=drug taken, 2=no drug taken.

我需要总结每个ID:


  • 0day:平均得分

  • -1day:

  • +1天:服用药物后几天的平均分数。

如果某药物连续2天服用(例如,示例的最后2行),则这些分数不应在-1天或+1天计算中计算(即,每次最后两行将对0day得分有所贡献,但对其他指标无贡献。

If a drug was taken 2 days in a row (eg the last 2 rows of the example) then these scores should not be counted in the -1day or +1day calculations (i.e., each of the last two rows would contribute to the 0day score but would not contribute to the other metrics).

因此,对于此示例数据,我需要一个如下的输出表:

So for this example data, I would need an output table like this:

    -1day   0day      +1day
A   3.5     4         4
B   3       3         4
C           3.25      2.5

请注意,没有所有日期的记录,-1day和+1天的计算需要基于实际日期,而不仅仅是数据集中的记录。

我不知道该怎么做。

I have no idea how to do this.

我还有两个额外的奖励问题:

I also have two additional bonus questions:


  • 我很可能会还需要计算-2天和+2天的分数,因此需要能够调整答案来做到这一点。

  • I will most likely also need to calculate -2day and +2day scores, so need to be able to adapt an answer to do that.

我该如何计算NoDrug得分,这是不在用药日后5天内的所有天的平均值。

How could I calculate a NoDrug score, which is the mean of all days that are not within 5 days of a drug taking day.

以下是使用以下示例数据生成数据框的代码:

Here is code to generate a dataframe with this example data:

data<-data.frame(ID=c("A","A","A","A","A","A","B","B","B","B","C","C","C","C","C","C","C"),
                 date=as.Date(c("28/08/2016","29/08/2016","30/08/2016","2/09/2016","3/09/2016","4/09/2016","8/08/2016","9/08/2016","10/08/2016","11/08/2016","30/11/2016","2/12/2016","3/12/2016","5/12/2016","6/12/2016","8/12/2016","9/12/2016"),format= "%d/%m/%Y"),
                 drug=c(2,1,2,2,1,2,1,2,2,1,2,1,2,1,2,1,1),
                 score=c(3,4,4,4,4,4,3,4,3,3,4,5,1,4,4,2,2))


推荐答案

您可以使用dplyr来获得此信息:

You can use dplyr to get this:




df <- data.frame(
  ID=c("A","A","A","A","A","A","B","B","B","B","C","C","C","C","C","C","C"),
  date=as.Date(c("28/08/2016","29/08/2016","30/08/2016","2/09/2016","3/09/2016","4/09/2016","8/08/2016","9/08/2016","10/08/2016","11/08/2016","30/11/2016","2/12/2016","3/12/2016","5/12/2016","6/12/2016","8/12/2016","9/12/2016"),format= "%d/%m/%Y"),
  drug=c(2,1,2,2,1,2,1,2,2,1,2,1,2,1,2,1,1),
  score=c(3,4,4,4,4,4,3,4,3,3,4,5,1,4,4,2,2)
)

df

#>    ID       date drug score
#> 1   A 2016-08-28    2     3
#> 2   A 2016-08-29    1     4
#> 3   A 2016-08-30    2     4
#> 4   A 2016-09-02    2     4
#> 5   A 2016-09-03    1     4
#> 6   A 2016-09-04    2     4
#> 7   B 2016-08-08    1     3
#> 8   B 2016-08-09    2     4
#> 9   B 2016-08-10    2     3
#> 10  B 2016-08-11    1     3
#> 11  C 2016-11-30    2     4
#> 12  C 2016-12-02    1     5
#> 13  C 2016-12-03    2     1
#> 14  C 2016-12-05    1     4
#> 15  C 2016-12-06    2     4
#> 16  C 2016-12-08    1     2
#> 17  C 2016-12-09    1     2



填写缺少的行(天)



解决这类问题的一种好方法是使隐式丢失观察明示丢失,是使用 tidyr :: complete

Fill in missing rows (days)

A nice way to solve these sorts of problems, making rows implicitly missing observations explicitly missing, is to use tidyr::complete

library(dplyr)
library(tidyr)

df1 <- df %>% 
  group_by(ID) %>% 
  complete(date = seq(min(date), max(date), by = "day"))

df1

#> Source: local data frame [22 x 4]
#> Groups: ID [3]
#> 
#> # A tibble: 22 x 4
#>        ID       date  drug score
#>    <fctr>     <date> <dbl> <dbl>
#>  1      A 2016-08-28     2     3
#>  2      A 2016-08-29     1     4
#>  3      A 2016-08-30     2     4
#>  4      A 2016-08-31    NA    NA
#>  5      A 2016-09-01    NA    NA
#>  6      A 2016-09-02     2     4
#>  7      A 2016-09-03     1     4
#>  8      A 2016-09-04     2     4
#>  9      B 2016-08-08     1     3
#> 10      B 2016-08-09     2     4
#> # ... with 12 more rows



分类天数



Categorize days

df2 <- df1 %>% 
  group_by(ID) %>% 
  mutate(day_of = drug == 1,
         day_before = (lead(drug) == 1 & day_of == FALSE),
         day_after = (lag(drug) == 1 & day_of == FALSE))

df2

#> Source: local data frame [22 x 7]
#> Groups: ID [3]
#> 
#> # A tibble: 22 x 7
#>        ID       date  drug score day_of day_before day_after
#>    <fctr>     <date> <dbl> <dbl>  <lgl>      <lgl>     <lgl>
#>  1      A 2016-08-28     2     3  FALSE       TRUE        NA
#>  2      A 2016-08-29     1     4   TRUE      FALSE     FALSE
#>  3      A 2016-08-30     2     4  FALSE         NA      TRUE
#>  4      A 2016-08-31    NA    NA     NA         NA     FALSE
#>  5      A 2016-09-01    NA    NA     NA      FALSE        NA
#>  6      A 2016-09-02     2     4  FALSE       TRUE        NA
#>  7      A 2016-09-03     1     4   TRUE      FALSE     FALSE
#>  8      A 2016-09-04     2     4  FALSE         NA      TRUE
#>  9      B 2016-08-08     1     3   TRUE      FALSE     FALSE
#> 10      B 2016-08-09     2     4  FALSE      FALSE      TRUE
#> # ... with 12 more rows



按日期类型汇总



dplyr :: mutate_at 将一个函数(在 funs()中)应用于所有选定列在 vars()中。 summarise_at 的操作方式与对某些选定列的操作方式相同,但不是更改完整数据集的值,而是将其减少为每组一行。可以阅读有关m mutate 的更多信息a>, 摘要 ,和特殊的 * _ at

Summarise by day types

dplyr::mutate_at applies a function (in funs()) to all the columns selected in vars(). summarise_at operates the same way in terms of operating on a some selected columns, but instead of changing the values of the full dataset it reduces it done to one row per group. Can can read more about mmutate, summarise, and the special *_at versions.

df3 <- df2 %>% 
  mutate_at(vars(starts_with("day_")), funs(if_else(. == TRUE, score, NA_real_))) %>% 
  summarise_at(vars(starts_with("day_")), mean, na.rm = TRUE)

df3

#> # A tibble: 3 x 4
#>       ID day_of day_before day_after
#>   <fctr>  <dbl>      <dbl>     <dbl>
#> 1      A   4.00        3.5       4.0
#> 2      B   3.00        3.0       4.0
#> 3      C   3.25        NaN       2.5

这篇关于如何根据日期计算汇总数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆