使用 dplyr group_by 时将汇总条件应用于一系列列? [英] Apply a summarise condition to a range of columns when using dplyr group_by?
问题描述
假设我们想要group_by()
和summarise
一个包含非常多列的海量数据框,但是有一些大的连续列组将具有相同的summarise
条件(例如max
、mean
等)
有没有办法避免为每一列指定summarise
条件,而是为列范围指定?
示例
假设我们想这样做:
iris %>%group_by(物种)%>%总结(最大(Sepal.Length),平均值(Sepal.Width),平均值(Petal.Length),平均值(Petal.Width))
但请注意,连续 3 列具有相同的 summarise
条件,mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)
有没有办法使用诸如 mean(Sepal.Width:Petal.Width)
之类的方法来指定列范围的条件,从而避免输入汇总条件中间的所有列多次)
注意
上面的 iris 示例是一个小型且易于管理的示例,其范围为 3 个连续列,但实际用例有大约数百个.
即将发布的版本 1.0.0 将具有 across()
功能可以满足您的需求
基本用法
<块引用>
across()
有两个主要参数:
<块引用>
- 第一个参数
.cols
选择要操作的列.它使用整洁的选择(如select()
),因此您可以通过以下方式选择变量位置、名称和类型.
<块引用>
- 第二个参数
.fns
是一个函数或要应用的函数列表每列.这也可以是 purrr 风格的公式(或公式列表)像~.x/2
.(这个参数是可选的,如果你只是想要,你可以省略它获取底层数据;你会看到该技术用于vignette(rowwise")
.)
### 先在 GitHub 上安装开发版# install.packages("devtools")# devtools::install_github(tidyverse/dplyr")图书馆(dplyr,warn.conflicts = FALSE)
使用 .names
参数控制如何创建名称,该参数采用 glue 规格:
iris %>%group_by(物种)%>%总结(跨越(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),跨(c(Sepal.Length),〜max(.x,na.rm = TRUE),.names =max_{col}"))#># 小块:3 x 5#>物种 mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length#>* <fct><dbl><dbl><dbl><dbl>#>1 setosa 3.43 1.46 0.246 5.8#>2 杂色 2.77 4.26 1.33 7#>3 维吉尼亚 2.97 5.55 2.03 7.9
使用多种功能
my_func <- list(均值 = ~ 均值(., na.rm = TRUE),max = ~ max(., na.rm = TRUE))虹膜%>%group_by(物种)%>%summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))#># 小费:3 x 9#>物种均值.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width#>* <fct><dbl><dbl><dbl><dbl>#>1 setosa 5.01 5.8 3.43 4.4#>2 杂色 5.94 7 2.77 3.4#>3 维吉尼亚 6.59 7.9 2.97 3.8#>mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width#>* <dbl><dbl><dbl><dbl>#>1 1.46 1.9 0.246 0.6#>2 4.26 5.1 1.33 1.8#>3 5.55 6.9 2.03 2.5
由 reprex 包 (v0.3.0) 于 2020 年 3 月 6 日创建上>
Suppose we want to group_by()
and summarise
a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise
condition (e.g. max
, mean
etc)
Is there a way to avoid having to specify the summarise
condition for each and every column, and instead do it for ranges of columns?
Example
Suppose we want to do this:
iris %>%
group_by(Species) %>%
summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))
but note that 3 consecutive columns have the same summarise
condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)
Is there a way to use some method like mean(Sepal.Width:Petal.Width)
to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)
Note
The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.
The upcoming version 1.0.0 of dplyr
will have across()
function that does what you wish for
Basic usage
across()
has two primary arguments:
- The first argument,
.cols
, selects the columns you want to operate on. It uses tidy selection (likeselect()
) so you can pick variables by position, name, and type.
- The second argument,
.fns
, is a function or list of functions to apply to each column. This can also be a purrr style formula (or list of formulas) like~ .x / 2
. (This argument is optional, and you can omit it if you just want to get the underlying data; you'll see that technique used invignette("rowwise")
.)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)
Control how the names are created with the .names
argument which takes a glue spec:
iris %>%
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9
Using multiple functions
my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5
Created on 2020-03-06 by the reprex package (v0.3.0)
这篇关于使用 dplyr group_by 时将汇总条件应用于一系列列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!