如何按组有效地汇总(具有多个输出)数据集中的列? [英] How to summarize (with multiple outputs) a column in a dataset by groups efficiently?
问题描述
通常,我需要按组为变量提供摘要统计信息(平均值,标准偏差,中位数,Q1等)。
Often times, I need to provide summary statistics (mean, sd, median, Q1 etc.) to a variable by groups.
当前我正在使用类似代码,
Currently I am using code like,
data %>%
group_by(region, group) %>%
summarise(mean=mean(value),
sd=sd(value),
min=min(value),
q1 = quantile(value, probs=0.25),
median = median(value),
q3 = quantile(value, probs=0.75),
max=max(value))
我发现自己经常重复这种做法。有没有更好的方法来获取相同的汇总表?谢谢。
I found myself repeat this practice a lot. Is there a better way to get the same summary table? Thanks.
推荐答案
正如@January在注释中所建议的那样,创建函数是一个好主意。在这里,我使用 quos
引用了 ...
的每个参数,并将它们拼接为 group_by
使用 !!!
。 enquo
用于将表达式转换为quosure。 !!
然后在每个摘要函数的上下文中取消引用:
As @January suggested in the comments, creating a function is a good idea. Here I used quos
to quote each argument of ...
and splice them into group_by
using !!!
. enquo
is used to convert the expression into a quosure. !!
then unquotes it within the context of each summary function:
library(dplyr)
library(rlang)
summary_stats1 <- function(data, value, ...){
value <- enquo(value)
data %>%
group_by(!!!quos(...)) %>%
summarise(mean=mean(!!value),
sd=sd(!!value),
min=min(!!value),
q1 = quantile(!!value, probs=0.25),
median = median(!!value),
q3 = quantile(!!value, probs=0.75),
max=max(!!value))
}
或者,使用 group_by_at
。接受 vars
辅助函数,直接使用 ...
:
Alternatively, use group_by_at
. Which accepts the vars
helper function, taking ...
directly:
summary_stats2 <- function(data, value, ...){
value <- enquo(value)
data %>%
group_by_at(vars(...)) %>%
summarise(mean=mean(!!value),
sd=sd(!!value),
min=min(!!value),
q1 = quantile(!!value, probs=0.25),
median = median(!!value),
q3 = quantile(!!value, probs=0.75),
max=max(!!value))
}
我们还可以使用新的插值模式(rlang 0.4.0),如此处可简化引用和取消引用的过程:
We can also use the new interpolation pattern (rlang 0.4.0), described here to simplify the quote and unquote process:
summary_stats3 <- function(data, value, ...){
data %>%
group_by_at(vars(...)) %>%
summarise(mean=mean({{ value }}),
sd=sd({{ value }}),
min=min({{ value }}),
q1 = quantile({{ value }}, probs=0.25),
median = median({{ value }}),
q3 = quantile({{ value }}, probs=0.75),
max=max({{ value }}))
}
输出:
> summary_stats1(mtcars, mpg, gear, am)
> summary_stats2(mtcars, mpg, gear, am)
> summary_stats3(mtcars, mpg, gear, am)
# A tibble: 4 x 9
# Groups: gear [3]
gear am mean sd min q1 median q3 max
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 0 16.1 3.37 10.4 14.5 15.5 18.4 21.5
2 4 0 21.0 3.07 17.8 18.8 21 23.2 24.4
3 4 1 26.3 5.41 21 21.3 25.0 30.9 33.9
4 5 1 21.4 6.66 15 15.8 19.7 26 30.4
这篇关于如何按组有效地汇总(具有多个输出)数据集中的列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!