按时间段按组查找平均值,并检索相同时间段的最后日期 [英] Find average by group over a time period and retrieve last date for same period
问题描述
下面是具有四列的可复制数据表:
Below is a reproducible data table with four columns:
- Date
- category
- value1
- value2
- Date
- category
- value1
- value2
如标题所示,我想根据每个类别的 value1 和 value2 计算平均值,并将这些类别的最后观察日期保留在结果数据框中。
As the title suggests, I'd like to calculate average values from value1 and value2 per category, and keep the last observed date for those categories in the resulting dataframe.
以下是输入:
# Libraries
library(dplyr)
library(data.table)
# Reproducible data table
set.seed(1234)
date <- seq(as.Date("2017-01-01"), by = "month", length.out = 10)
category <- (c('A','A','B','B','C','C','C','C','C', 'C'))
value1 <- sample(seq(from = 91, to = 100, by = 1))
value2 <- sample(seq(from = 51, to = 60, by = 1))
dt <- data.table(date, category, value1, value2)
print(dt)
date category value1 value2
1: 2017-01-01 A 92 57
2: 2017-02-01 A 96 55
3: 2017-03-01 B 95 53
4: 2017-04-01 B 98 60
5: 2017-05-01 C 99 52
6: 2017-06-01 C 94 59
7: 2017-07-01 C 91 56
8: 2017-08-01 C 97 51
9: 2017-09-01 C 100 58
10: 2017-10-01 C 93 54
这是我到目前为止的内容:
dt2 <- dt %>% group_by(category) %>% summarise_each(funs(mean))
print(dt2)
# A tibble: 3 x 4
category date value1 value2
<chr> <date> <dbl> <dbl>
1 A 2017-01-16 94.00000 56.0
2 B 2017-03-16 96.50000 56.5
3 C 2017-07-16 95.66667 55.0
这是所需的输出:
category date value1 value2
1 A 2017-02-01 94.00000 56.0
2 B 2017-04-01 96.50000 56.5
3 C 2017-10-01 95.66667 55.0
谢谢您的任何建议!
推荐答案
这是 data.table
方法。我们可以使用 j
参数在。()
中执行计算,并在<$ c $中设置分组c> by 参数。
Here is the data.table
approach. We can perform the calculations in .()
with the j
argument, and set the grouping in the by
argument.
dt[, .(date = last(date), value1 = mean(value1), value2 = mean(value2)), by = category]
弗兰克(Frank)开发的一种有效方法,以对此发表评论。这种方法只需要使用 .SD
和 .SDcols编写一次
mean
函数一次。 / code>来指定要用平均值汇总的列。
Here is a more efficient way developed by Frank as a comment on this post. This approach only needs to write the mean
function once, using .SD
and .SDcols
to specify which columns to be summarised in mean.
dt[, c(.(date = last(date)), lapply(.SD, mean)), by = category, .SDcols = value1:value2]
如果要使用 dplyr
,则可以使用Z.Lin的方法。但是,如果有很多值列,例如 value1
到 value10
,则可以执行以下操作。
And if you want to use dplyr
, you can use Z.Lin's approach. However, if there are lots of value columns, such as value1
to value10
, you can do the following.
dt %>%
group_by(category) %>%
summarise_all(funs(if_else(is.numeric(.), mean(.), last(.))))
此代码将
最后一个提醒, summarise_each $>仅计算列的均值,否则将报告组的最后一行。 c $ c>已被弃用。请使用
summarise_all
, summarise_if
或 summarise_at
。
One final reminder, summarise_each
has been deprecated. Please use summarise_all
, summarise_if
, or summarise_at
.
这篇关于按时间段按组查找平均值,并检索相同时间段的最后日期的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!