如何根据日期计算汇总数据 [英] How to summarise data based on calculations on dates
问题描述
我有如下数据(注意日期采用DD-MM-YYYY格式):
I have data that looks like this (note dates are in DD-MM-YYYY format):
ID date drug score
A 28/08/2016 2 3
A 29/08/2016 1 4
A 30/08/2016 2 4
A 2/09/2016 2 4
A 3/09/2016 1 4
A 4/09/2016 2 4
B 8/08/2016 1 3
B 9/08/2016 2 4
B 10/08/2016 2 3
B 11/08/2016 1 3
C 30/11/2016 2 4
C 2/12/2016 1 5
C 3/12/2016 2 1
C 5/12/2016 1 4
C 6/12/2016 2 4
C 8/12/2016 1 2
C 9/12/2016 1 2
对于药物:1 =服用药物,2 =不服用药物。
For 'drug': 1=drug taken, 2=no drug taken.
我需要总结每个ID:
- 0day:平均得分
- -1day:
- +1天:服用药物后几天的平均分数。
如果某药物连续2天服用(例如,示例的最后2行),则这些分数不应在-1天或+1天计算中计算(即,每次最后两行将对0day得分有所贡献,但对其他指标无贡献。
If a drug was taken 2 days in a row (eg the last 2 rows of the example) then these scores should not be counted in the -1day or +1day calculations (i.e., each of the last two rows would contribute to the 0day score but would not contribute to the other metrics).
因此,对于此示例数据,我需要一个如下的输出表:
So for this example data, I would need an output table like this:
-1day 0day +1day
A 3.5 4 4
B 3 3 4
C 3.25 2.5
请注意,没有所有日期的记录,-1day和+1天的计算需要基于实际日期,而不仅仅是数据集中的记录。
我不知道该怎么做。
I have no idea how to do this.
我还有两个额外的奖励问题:
I also have two additional bonus questions:
-
我很可能会还需要计算-2天和+2天的分数,因此需要能够调整答案来做到这一点。
I will most likely also need to calculate -2day and +2day scores, so need to be able to adapt an answer to do that.
我该如何计算NoDrug得分,这是不在用药日后5天内的所有天的平均值。
How could I calculate a NoDrug score, which is the mean of all days that are not within 5 days of a drug taking day.
以下是使用以下示例数据生成数据框的代码:
Here is code to generate a dataframe with this example data:
data<-data.frame(ID=c("A","A","A","A","A","A","B","B","B","B","C","C","C","C","C","C","C"),
date=as.Date(c("28/08/2016","29/08/2016","30/08/2016","2/09/2016","3/09/2016","4/09/2016","8/08/2016","9/08/2016","10/08/2016","11/08/2016","30/11/2016","2/12/2016","3/12/2016","5/12/2016","6/12/2016","8/12/2016","9/12/2016"),format= "%d/%m/%Y"),
drug=c(2,1,2,2,1,2,1,2,2,1,2,1,2,1,2,1,1),
score=c(3,4,4,4,4,4,3,4,3,3,4,5,1,4,4,2,2))
推荐答案
您可以使用dplyr来获得此信息:
You can use dplyr to get this:
df <- data.frame(
ID=c("A","A","A","A","A","A","B","B","B","B","C","C","C","C","C","C","C"),
date=as.Date(c("28/08/2016","29/08/2016","30/08/2016","2/09/2016","3/09/2016","4/09/2016","8/08/2016","9/08/2016","10/08/2016","11/08/2016","30/11/2016","2/12/2016","3/12/2016","5/12/2016","6/12/2016","8/12/2016","9/12/2016"),format= "%d/%m/%Y"),
drug=c(2,1,2,2,1,2,1,2,2,1,2,1,2,1,2,1,1),
score=c(3,4,4,4,4,4,3,4,3,3,4,5,1,4,4,2,2)
)
df
#> ID date drug score
#> 1 A 2016-08-28 2 3
#> 2 A 2016-08-29 1 4
#> 3 A 2016-08-30 2 4
#> 4 A 2016-09-02 2 4
#> 5 A 2016-09-03 1 4
#> 6 A 2016-09-04 2 4
#> 7 B 2016-08-08 1 3
#> 8 B 2016-08-09 2 4
#> 9 B 2016-08-10 2 3
#> 10 B 2016-08-11 1 3
#> 11 C 2016-11-30 2 4
#> 12 C 2016-12-02 1 5
#> 13 C 2016-12-03 2 1
#> 14 C 2016-12-05 1 4
#> 15 C 2016-12-06 2 4
#> 16 C 2016-12-08 1 2
#> 17 C 2016-12-09 1 2
填写缺少的行(天)
解决这类问题的一种好方法是使隐式丢失观察明示丢失,是使用 tidyr :: complete
Fill in missing rows (days)
A nice way to solve these sorts of problems, making rows implicitly missing observations explicitly missing, is to use tidyr::complete
library(dplyr)
library(tidyr)
df1 <- df %>%
group_by(ID) %>%
complete(date = seq(min(date), max(date), by = "day"))
df1
#> Source: local data frame [22 x 4]
#> Groups: ID [3]
#>
#> # A tibble: 22 x 4
#> ID date drug score
#> <fctr> <date> <dbl> <dbl>
#> 1 A 2016-08-28 2 3
#> 2 A 2016-08-29 1 4
#> 3 A 2016-08-30 2 4
#> 4 A 2016-08-31 NA NA
#> 5 A 2016-09-01 NA NA
#> 6 A 2016-09-02 2 4
#> 7 A 2016-09-03 1 4
#> 8 A 2016-09-04 2 4
#> 9 B 2016-08-08 1 3
#> 10 B 2016-08-09 2 4
#> # ... with 12 more rows
分类天数
Categorize days
df2 <- df1 %>%
group_by(ID) %>%
mutate(day_of = drug == 1,
day_before = (lead(drug) == 1 & day_of == FALSE),
day_after = (lag(drug) == 1 & day_of == FALSE))
df2
#> Source: local data frame [22 x 7]
#> Groups: ID [3]
#>
#> # A tibble: 22 x 7
#> ID date drug score day_of day_before day_after
#> <fctr> <date> <dbl> <dbl> <lgl> <lgl> <lgl>
#> 1 A 2016-08-28 2 3 FALSE TRUE NA
#> 2 A 2016-08-29 1 4 TRUE FALSE FALSE
#> 3 A 2016-08-30 2 4 FALSE NA TRUE
#> 4 A 2016-08-31 NA NA NA NA FALSE
#> 5 A 2016-09-01 NA NA NA FALSE NA
#> 6 A 2016-09-02 2 4 FALSE TRUE NA
#> 7 A 2016-09-03 1 4 TRUE FALSE FALSE
#> 8 A 2016-09-04 2 4 FALSE NA TRUE
#> 9 B 2016-08-08 1 3 TRUE FALSE FALSE
#> 10 B 2016-08-09 2 4 FALSE FALSE TRUE
#> # ... with 12 more rows
按日期类型汇总
dplyr :: mutate_at
将一个函数(在 funs()
中)应用于所有选定列在 vars()
中。 summarise_at
的操作方式与对某些选定列的操作方式相同,但不是更改完整数据集的值,而是将其减少为每组一行。可以阅读有关m mutate
的更多信息a>, 摘要
,和特殊的 * _ at
Summarise by day types
dplyr::mutate_at
applies a function (in funs()
) to all the columns selected in vars()
. summarise_at
operates the same way in terms of operating on a some selected columns, but instead of changing the values of the full dataset it reduces it done to one row per group. Can can read more about mmutate
, summarise
, and the special *_at
versions.
df3 <- df2 %>%
mutate_at(vars(starts_with("day_")), funs(if_else(. == TRUE, score, NA_real_))) %>%
summarise_at(vars(starts_with("day_")), mean, na.rm = TRUE)
df3
#> # A tibble: 3 x 4
#> ID day_of day_before day_after
#> <fctr> <dbl> <dbl> <dbl>
#> 1 A 4.00 3.5 4.0
#> 2 B 3.00 3.0 4.0
#> 3 C 3.25 NaN 2.5
这篇关于如何根据日期计算汇总数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!