用dplyr总结多列? [英] Summarizing multiple columns with dplyr?
问题描述
我在 dplyr 语法上有点挣扎.我有一个包含不同变量和一个分组变量的数据框.现在我想使用 R 中的 dplyr 计算每组中每一列的平均值.
I'm struggling a bit with the dplyr-syntax. I have a data frame with different variables and one grouping variable. Now I want to calculate the mean for each column within each group, using dplyr in R.
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
df %>% group_by(grp) %>% summarise(mean(a))
这给了我grp"表示的每个组的a"列的平均值.
This gives me the mean for column "a" for each group indicated by "grp".
我的问题是:是否有可能一次获得每个组中每一列的平均值?或者我必须为每一列重复 df %>% group_by(grp) %>% summarise(mean(a))
吗?
My question is: is it possible to get the means for each column within each group at once? Or do I have to repeat df %>% group_by(grp) %>% summarise(mean(a))
for each column?
我想要的是类似的东西
df %>% group_by(grp) %>% summarise(mean(a:d)) # "mean(a:d)" does not work
推荐答案
在 dplyr
(>=1.00) 你可以在 across(everything()
中使用 across(everything()
>summarise 将函数应用于所有变量:
In dplyr
(>=1.00) you may use across(everything()
in summarise
to apply a function to all variables:
library(dplyr)
df %>% group_by(grp) %>% summarise(across(everything(), list(mean)))
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
或者,purrrlyr
包提供相同的功能:
Alternatively, the purrrlyr
package provides the same functionality:
library(purrrlyr)
df %>% slice_rows("grp") %>% dmap(mean)
#> # A tibble: 3 x 5
#> grp a b c d
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 3.08 2.98 2.98 2.91
#> 2 2 3.03 3.04 2.97 2.87
#> 3 3 2.85 2.95 2.95 3.06
也不要忘记 data.table
(使用 keyby
对排序组进行排序):
Also don't forget about data.table
(use keyby
to sort sort groups):
library(data.table)
setDT(df)[, lapply(.SD, mean), keyby = grp]
#> grp a b c d
#> 1: 1 3.079412 2.979412 2.979412 2.914706
#> 2: 2 3.029126 3.038835 2.967638 2.873786
#> 3: 3 2.854701 2.948718 2.951567 3.062678
让我们尝试比较性能.
library(dplyr)
library(purrrlyr)
library(data.table)
library(bench)
set.seed(123)
n <- 10000
df <- data.frame(
a = sample(1:5, n, replace = TRUE),
b = sample(1:5, n, replace = TRUE),
c = sample(1:5, n, replace = TRUE),
d = sample(1:5, n, replace = TRUE),
grp = sample(1:3, n, replace = TRUE)
)
dt <- setDT(df)
mark(
dplyr = df %>% group_by(grp) %>% summarise(across(everything(), list(mean))),
purrrlyr = df %>% slice_rows("grp") %>% dmap(mean),
data.table = dt[, lapply(.SD, mean), keyby = grp],
check = FALSE
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dplyr 2.81ms 2.85ms 328. NA 17.3
#> 2 purrrlyr 7.96ms 8.04ms 123. NA 24.5
#> 3 data.table 596.33µs 707.91µs 1409. NA 10.3
这篇关于用dplyr总结多列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!