R summarise_at根据条件动态:对某些列取平均值,对其他列取平均值 [英] R summarise_at dynamically by condition : mean for some columns, sum for others

查看:91
本文介绍了R summarise_at根据条件动态:对某些列取平均值,对其他列取平均值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在标题中动态添加了该词:当我在 summarise_at() vars(c())时c $ c>用于快速清晰的示例,但实际上它用于 contains() starts_with() matches(,, perl = TRUE),因为我有50列,其中有许多 sum()和一些 mean()

I've added the word dynamically in the title: When I use vars(c()) in the summarise_at() it's for fast and clear examples, but in fact it's for use contains(), starts_with() and matches(,, perl=TRUE), because I have 50 columns, with many sum() and some mean().

目标是使用 tbl()..%>生成动态SQL。 %group_by()...%>%summarise_at()...%>%collect()

我添加了第二个示例中生成的SQL示例

I added example with SQL generated in my second example



library(tidyverse)
(mtcars 
  %>% group_by(carb)
  %>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
  # I don't want this line below, I would like a conditionnal in summarise_at() because I have 50 columns in my real case
  %>% select(carb,cyl_mean,disp_mean,mpg_sum)
)
#> # A tibble: 6 x 4
#>    carb cyl_mean disp_mean mpg_sum
#>   <dbl>    <dbl>     <dbl>   <dbl>
#> 1     1     4.57      134.   177. 
#> 2     2     5.6       208.   224  
#> 3     3     8         276.    48.9
#> 4     4     7.2       309.   158. 
#> 5     6     6         145     19.7
#> 6     8     8         301     15

Created on 2020-02-19 by the reprex package (v0.3.0)


这可行,但是我只希望mpg的总和,而只希望cyl和disp的意思。


library(RSQLite)
library(dbplyr)
library(tidyverse)
library(DBI)

db <- dbConnect(SQLite(),":memory:")

dbCreateTable(db, "mtcars_table", mtcars)

(tbl( db, build_sql( con=db,"select * from mtcars_table" ))
  %>% group_by(carb)
  %>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
  %>% select(carb,cyl_mean,disp_mean,mpg_sum)                   
  %>% show_query()
)
#> <SQL>
#> Warning: Missing values are always removed in SQL.[...]  to silence this warning
#> SELECT `carb`, `cyl_mean`, `disp_mean`, `mpg_sum`
#> FROM (SELECT `carb`, AVG(`mpg`) AS `mpg_mean`, AVG(`cyl`) AS `cyl_mean`, AVG(`disp`) AS `disp_mean`, SUM(`mpg`) AS `mpg_sum`, SUM(`cyl`) AS `cyl_sum`, SUM(`disp`) AS `disp_sum`
#> FROM (select * from mtcars_table)
#> GROUP BY `carb`)
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.30.1 [:memory:]
#> # … with 4 variables: carb <dbl>, cyl_mean <lgl>, disp_mean <lgl>,
#> #   mpg_sum <lgl>


我尝试了所有类似的方法,但是它不起作用或产生错误。


错误


(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse(vars(contains(names(.),"mpg")),list(sum(.)),list(mean(.)))) )


不好,列太多


library(tidyverse)
(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse ((names(.)=="mpg"), list(~sum(.)) , list(~mean(.)))))
#> # A tibble: 6 x 34
#>    carb mpg_sum cyl_sum disp_sum mpg_mean..2 cyl_mean..2 disp_mean..2
#>   <dbl>   <dbl>   <dbl>    <dbl>       <dbl>       <dbl>        <dbl>
#> 1     1   177.       32     940.        25.3        4.57         134.
#> 2     2   224        56    2082.        22.4        5.6          208.
#> 3     3    48.9      24     827.        16.3        8            276.
#> 4     4   158.       72    3088.        15.8        7.2          309.
#> 5     6    19.7       6     145         19.7        6            145 
#> 6     8    15         8     301         15          8            301 
#> # … with 27 more variables: mpg_mean..3 <dbl>, cyl_mean..3 <dbl>,
#> #   disp_mean..3 <dbl>, mpg_mean..4 <dbl>, cyl_mean..4 <dbl>,
#> #   disp_mean..4 <dbl>, mpg_mean..5 <dbl>, cyl_mean..5 <dbl>,
#> #   disp_mean..5 <dbl>, mpg_mean..6 <dbl>, cyl_mean..6 <dbl>,
#> #   disp_mean..6 <dbl>, mpg_mean..7 <dbl>, cyl_mean..7 <dbl>,
#> #   disp_mean..7 <dbl>, mpg_mean..8 <dbl>, cyl_mean..8 <dbl>,
#> #   disp_mean..8 <dbl>, mpg_mean..9 <dbl>, cyl_mean..9 <dbl>,
#> #   disp_mean..9 <dbl>, mpg_mean..10 <dbl>, cyl_mean..10 <dbl>,
#> #   disp_mean..10 <dbl>, mpg_mean..11 <dbl>, cyl_mean..11 <dbl>,
#> #   disp_mean..11 <dbl>


其他一些尝试和说明


我想要有条件的 sum(。)平均值(。),具体取决于 summarise()列的名称

Some other tries and remarks

I would like conditional sum(.) or mean(.) depending of the name of the column in the summarise().

如果它不仅接受原始函数,那可能会很好。

It could be good if it accepts not only primitive functions.

最后,它用于 tbl()..%>%group_by()...%>%summarise_at()...%>%collect()生成带 AVG()和 SUM()

At then end it's for tbl()..%>% group_by() ... %>% summarise_at()...%>% collect() to generate conditional SQL with AVG() and SUM().

Mssql SQL函数,例如 〜(convert(varchar())适用于 mutate_at()和类似的〜AVG()适用于 summarise_at(),但是我到达了同一点:条件 summarise_at()不起作用,这取决于列名称。

Mssql SQL function like ~(convert(varchar()) works for mutate_at() and similar ~AVG()works for summarise_at() but I arrive at the same point: conditional summarise_at() doesn't work depending of name of columns.

:)

推荐答案

选项是 group_by 'carb',然后创建'mpg'的 sum 作为另一个分组变量,然后使用 summarise_at ,其余的所需变量的数量

An option is to group_by the 'carb', and then create the sum of 'mpg' as another grouping variable and then use summarise_at with the rest of the variables needed

library(dplyr)
mtcars %>%
    group_by(carb) %>%
    group_by(mpg_sum = sum(mpg), .add = TRUE) %>%
    summarise_at(vars(cyl, disp), list(mean = mean))
# A tibble: 6 x 4
# Groups:   carb [6]
#   carb mpg_sum cyl_mean disp_mean
#  <dbl>   <dbl>    <dbl>     <dbl>
#1     1   177.      4.57      134.
#2     2   224       5.6       208.
#3     3    48.9     8         276.
#4     4   158.      7.2       309.
#5     6    19.7     6         145 
#6     8    15       8         301 






或使用 dplyr 开发版本,可以在单个<$ c中完成$ c>总结,方法是将 cross 中的列块和单个列分别包装起来,并对其应用不同的功能


Or using the devel version of dplyr, this can be done in a single summarise by wrapping the blocks of columns in across and the single column by themselves and apply different functions on it

mtcars %>%
  group_by(carb) %>% 
  summarise(across(one_of(c("cyl", "disp")), list(mean = mean)), 
            mpg_sum = sum(mpg))
# A tibble: 6 x 4
#   carb cyl_mean disp_mean mpg_sum
#  <dbl>    <dbl>     <dbl>   <dbl>
#1     1     4.57      134.   177. 
#2     2     5.6       208.   224  
#3     3     8         276.    48.9
#4     4     7.2       309.   158. 
#5     6     6         145     19.7
#6     8     8         301     15  

注意: summarise_at / summarise_if / mutate_at / mutate_if /...等将被 across 动词替换为默认功能( summarise / mutate / filter /...

NOTE: summarise_at/summarise_if/mutate_at/mutate_if/... etc. will be superseded by the across verb with the default functions (summarise/mutate/filter/...) in the upcoming releases

这篇关于R summarise_at根据条件动态:对某些列取平均值,对其他列取平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆