使用 dplyr group_by 时将汇总条件应用于一系列列? [英] Apply a summarise condition to a range of columns when using dplyr group_by?

查看:16
本文介绍了使用 dplyr group_by 时将汇总条件应用于一系列列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们想要group_by()summarise一个包含非常多列的海量数据框,但是有一些大的连续列组将具有相同的summarise 条件(例如maxmean 等)

有没有办法避免为每一列指定summarise条件,而是为列范围指定?

示例

假设我们想这样做:

iris %>%group_by(物种)%>%总结(最大(Sepal.Length),平均值(Sepal.Width),平均值(Petal.Length),平均值(Petal.Width))

但请注意,连续 3 列具有相同的 summarise 条件,mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)

有没有办法使用诸如 mean(Sepal.Width:Petal.Width) 之类的方法来指定列范围的条件,从而避免输入汇总条件中间的所有列多次)

注意

上面的 iris 示例是一个小型且易于管理的示例,其范围为 3 个连续列,但实际用例有大约数百个.

解决方案

即将发布的版本 1.0.0 将具有 across() 功能可以满足您的需求

<块引用>

基本用法

<块引用>

across() 有两个主要参数:

<块引用>

  • 第一个参数 .cols 选择要操作的列.它使用整洁的选择(如 select()),因此您可以通过以下方式选择变量位置、名称和类型.

<块引用>

  • 第二个参数 .fns 是一个函数或要应用的函数列表每列.这也可以是 purrr 风格的公式(或公式列表)像~.x/2.(这个参数是可选的,如果你只是想要,你可以省略它获取底层数据;你会看到该技术用于vignette(rowwise").)

### 先在 GitHub 上安装开发版# install.packages("devtools")# devtools::install_github(tidyverse/dplyr")图书馆(dplyr,warn.conflicts = FALSE)

使用 .names 参数控制如何创建名称,该参数采用 glue 规格:

iris %>%group_by(物种)%>%总结(跨越(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),跨(c(Sepal.Length),〜max(.x,na.rm = TRUE),.names =max_{col}"))#># 小块:3 x 5#>物种 mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length#>* <fct><dbl><dbl><dbl><dbl>#>1 setosa 3.43 1.46 0.246 5.8#>2 杂色 2.77 4.26 1.33 7#>3 维吉尼亚 2.97 5.55 2.03 7.9

使用多种功能

my_func <- list(均值 = ~ 均值(., na.rm = TRUE),max = ~ max(., na.rm = TRUE))虹膜%>%group_by(物种)%>%summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))#># 小费:3 x 9#>物种均值.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width#>* <fct><dbl><dbl><dbl><dbl>#>1 setosa 5.01 5.8 3.43 4.4#>2 杂色 5.94 7 2.77 3.4#>3 维吉尼亚 6.59 7.9 2.97 3.8#>mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width#>* <dbl><dbl><dbl><dbl>#>1 1.46 1.9 0.246 0.6#>2 4.26 5.1 1.33 1.8#>3 5.55 6.9 2.03 2.5

reprex 包 (v0.3.0) 于 2020 年 3 月 6 日创建

Suppose we want to group_by() and summarise a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise condition (e.g. max, mean etc)

Is there a way to avoid having to specify the summarise condition for each and every column, and instead do it for ranges of columns?

Example

Suppose we want to do this:

iris %>% 
  group_by(Species) %>% 
  summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))

but note that 3 consecutive columns have the same summarise condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)

Is there a way to use some method like mean(Sepal.Width:Petal.Width) to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)

Note

The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.

解决方案

The upcoming version 1.0.0 of dplyr will have across() function that does what you wish for

Basic usage

across() has two primary arguments:

  • The first argument, .cols, selects the columns you want to operate on. It uses tidy selection (like select()) so you can pick variables by position, name, and type.

  • The second argument, .fns, is a function or list of functions to apply to each column. This can also be a purrr style formula (or list of formulas) like ~ .x / 2. (This argument is optional, and you can omit it if you just want to get the underlying data; you'll see that technique used in vignette("rowwise").)

### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)

Control how the names are created with the .names argument which takes a glue spec:

iris %>% 
  group_by(Species) %>% 
  summarise(
    across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
    across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
    )
#> # A tibble: 3 x 5
#>   Species    mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct>                 <dbl>            <dbl>            <dbl>            <dbl>
#> 1 setosa                 3.43             1.46            0.246              5.8
#> 2 versicolor             2.77             4.26            1.33               7  
#> 3 virginica              2.97             5.55            2.03               7.9

Using multiple functions

my_func <- list(
  mean = ~ mean(., na.rm = TRUE),
  max  = ~ max(., na.rm = TRUE)
)

iris %>%
  group_by(Species) %>%
  summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#>   Species    mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct>                  <dbl>            <dbl>            <dbl>           <dbl>
#> 1 setosa                  5.01              5.8             3.43             4.4
#> 2 versicolor              5.94              7               2.77             3.4
#> 3 virginica               6.59              7.9             2.97             3.8
#>   mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> *             <dbl>            <dbl>            <dbl>           <dbl>
#> 1              1.46              1.9            0.246             0.6
#> 2              4.26              5.1            1.33              1.8
#> 3              5.55              6.9            2.03              2.5

Created on 2020-03-06 by the reprex package (v0.3.0)

这篇关于使用 dplyr group_by 时将汇总条件应用于一系列列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆