R summarise_at根据条件动态:对某些列取平均值,对其他列取平均值 [英] R summarise_at dynamically by condition : mean for some columns, sum for others
问题描述
我在标题中动态添加了该词:当我在 summarise_at()$中使用
vars(c())
时c $ c>用于快速清晰的示例,但实际上它用于 contains()
, starts_with()
和 matches(,, perl = TRUE)
,因为我有50列,其中有许多 sum()
和一些 mean()
。
I've added the word dynamically in the title: When I use vars(c())
in the summarise_at()
it's for fast and clear examples, but in fact it's for use contains()
, starts_with()
and matches(,, perl=TRUE)
, because I have 50 columns, with many sum()
and some mean()
.
目标是使用 tbl()..%>生成动态SQL。 %group_by()...%>%summarise_at()...%>%collect()
。
我添加了第二个示例中生成的SQL示例
I added example with SQL generated in my second example
library(tidyverse)
(mtcars
%>% group_by(carb)
%>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
# I don't want this line below, I would like a conditionnal in summarise_at() because I have 50 columns in my real case
%>% select(carb,cyl_mean,disp_mean,mpg_sum)
)
#> # A tibble: 6 x 4
#> carb cyl_mean disp_mean mpg_sum
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 4.57 134. 177.
#> 2 2 5.6 208. 224
#> 3 3 8 276. 48.9
#> 4 4 7.2 309. 158.
#> 5 6 6 145 19.7
#> 6 8 8 301 15
Created on 2020-02-19 by the reprex package (v0.3.0)
这可行,但是我只希望mpg的总和,而只希望cyl和disp的意思。
library(RSQLite)
library(dbplyr)
library(tidyverse)
library(DBI)
db <- dbConnect(SQLite(),":memory:")
dbCreateTable(db, "mtcars_table", mtcars)
(tbl( db, build_sql( con=db,"select * from mtcars_table" ))
%>% group_by(carb)
%>% summarise_at(vars(c("mpg","cyl","disp")), list (~mean(.),~sum(.)))
%>% select(carb,cyl_mean,disp_mean,mpg_sum)
%>% show_query()
)
#> <SQL>
#> Warning: Missing values are always removed in SQL.[...] to silence this warning
#> SELECT `carb`, `cyl_mean`, `disp_mean`, `mpg_sum`
#> FROM (SELECT `carb`, AVG(`mpg`) AS `mpg_mean`, AVG(`cyl`) AS `cyl_mean`, AVG(`disp`) AS `disp_mean`, SUM(`mpg`) AS `mpg_sum`, SUM(`cyl`) AS `cyl_sum`, SUM(`disp`) AS `disp_sum`
#> FROM (select * from mtcars_table)
#> GROUP BY `carb`)
#> # Source: lazy query [?? x 4]
#> # Database: sqlite 3.30.1 [:memory:]
#> # … with 4 variables: carb <dbl>, cyl_mean <lgl>, disp_mean <lgl>,
#> # mpg_sum <lgl>
我尝试了所有类似的方法,但是它不起作用或产生错误。
错误
(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse(vars(contains(names(.),"mpg")),list(sum(.)),list(mean(.)))) )
不好,列太多
library(tidyverse)
(mtcars %>% group_by(carb)%>% summarise_at(vars(c("mpg","cyl","disp")),ifelse ((names(.)=="mpg"), list(~sum(.)) , list(~mean(.)))))
#> # A tibble: 6 x 34
#> carb mpg_sum cyl_sum disp_sum mpg_mean..2 cyl_mean..2 disp_mean..2
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 177. 32 940. 25.3 4.57 134.
#> 2 2 224 56 2082. 22.4 5.6 208.
#> 3 3 48.9 24 827. 16.3 8 276.
#> 4 4 158. 72 3088. 15.8 7.2 309.
#> 5 6 19.7 6 145 19.7 6 145
#> 6 8 15 8 301 15 8 301
#> # … with 27 more variables: mpg_mean..3 <dbl>, cyl_mean..3 <dbl>,
#> # disp_mean..3 <dbl>, mpg_mean..4 <dbl>, cyl_mean..4 <dbl>,
#> # disp_mean..4 <dbl>, mpg_mean..5 <dbl>, cyl_mean..5 <dbl>,
#> # disp_mean..5 <dbl>, mpg_mean..6 <dbl>, cyl_mean..6 <dbl>,
#> # disp_mean..6 <dbl>, mpg_mean..7 <dbl>, cyl_mean..7 <dbl>,
#> # disp_mean..7 <dbl>, mpg_mean..8 <dbl>, cyl_mean..8 <dbl>,
#> # disp_mean..8 <dbl>, mpg_mean..9 <dbl>, cyl_mean..9 <dbl>,
#> # disp_mean..9 <dbl>, mpg_mean..10 <dbl>, cyl_mean..10 <dbl>,
#> # disp_mean..10 <dbl>, mpg_mean..11 <dbl>, cyl_mean..11 <dbl>,
#> # disp_mean..11 <dbl>
其他一些尝试和说明
我想要有条件的 sum(。)
或平均值(。)
,具体取决于 summarise()列的名称
。
Some other tries and remarks
I would like conditional sum(.)
or mean(.)
depending of the name of the column in the summarise()
.
如果它不仅接受原始函数,那可能会很好。
It could be good if it accepts not only primitive functions.
最后,它用于 tbl()..%>%group_by()...%>%summarise_at()...%>%collect()
生成带 AVG()
和 SUM()
。
At then end it's for tbl()..%>% group_by() ... %>% summarise_at()...%>% collect()
to generate conditional SQL with AVG()
and SUM()
.
Mssql SQL函数,例如 〜(convert(varchar())
适用于 mutate_at()
和类似的〜AVG()
适用于 summarise_at()
,但是我到达了同一点:条件 summarise_at()
不起作用,这取决于列名称。
Mssql SQL function like ~(convert(varchar())
works for mutate_at()
and similar ~AVG()
works for summarise_at()
but I arrive at the same point: conditional summarise_at()
doesn't work depending of name of columns.
:)
推荐答案
选项是 group_by
'carb',然后创建'mpg'的 sum
作为另一个分组变量,然后使用 summarise_at
,其余的所需变量的数量
An option is to group_by
the 'carb', and then create the sum
of 'mpg' as another grouping variable and then use summarise_at
with the rest of the variables needed
library(dplyr)
mtcars %>%
group_by(carb) %>%
group_by(mpg_sum = sum(mpg), .add = TRUE) %>%
summarise_at(vars(cyl, disp), list(mean = mean))
# A tibble: 6 x 4
# Groups: carb [6]
# carb mpg_sum cyl_mean disp_mean
# <dbl> <dbl> <dbl> <dbl>
#1 1 177. 4.57 134.
#2 2 224 5.6 208.
#3 3 48.9 8 276.
#4 4 158. 7.2 309.
#5 6 19.7 6 145
#6 8 15 8 301
或使用 dplyr
的开发
版本,可以在单个<$ c中完成$ c>总结,方法是将 cross
中的列块和单个列分别包装起来,并对其应用不同的功能
Or using the devel
version of dplyr
, this can be done in a single summarise
by wrapping the blocks of columns in across
and the single column by themselves and apply different functions on it
mtcars %>%
group_by(carb) %>%
summarise(across(one_of(c("cyl", "disp")), list(mean = mean)),
mpg_sum = sum(mpg))
# A tibble: 6 x 4
# carb cyl_mean disp_mean mpg_sum
# <dbl> <dbl> <dbl> <dbl>
#1 1 4.57 134. 177.
#2 2 5.6 208. 224
#3 3 8 276. 48.9
#4 4 7.2 309. 158.
#5 6 6 145 19.7
#6 8 8 301 15
注意: summarise_at / summarise_if / mutate_at / mutate_if /...
等将被 across
动词替换为默认功能( summarise / mutate / filter /...
)
NOTE: summarise_at/summarise_if/mutate_at/mutate_if/...
etc. will be superseded by the across
verb with the default functions (summarise/mutate/filter/...
) in the upcoming releases
这篇关于R summarise_at根据条件动态:对某些列取平均值,对其他列取平均值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!