可以dplyr总结几个变量,而不列出每一个? [英] Can dplyr summarise over several variables without listing each one?
问题描述
dplyr是惊人的快,但我想知道我是否缺少一些东西:是可能总结几个变量。例如:
库(dplyr)
库(reshape2)
(df = dput(structure(list(sex = structure(c(1L,1L,2L,2L),.Label = c(boy,
girl),class =factor),age = c 52L,58L,40L,62L),bmi = c(25L,
23L,30L,26L),chol = c(187L,220L,190L,204L)).Names = c(sex b $ bage,bmi,chol),row.names = c(NA,-4L),class =data.frame))
sex age bmi chol
1男孩52 25 187
2男孩58 23 220
3女孩40 30 190
4女孩62 26 204
dg = group_by(df,性别)
使用这个小型数据机,很容易写
summarize(dg,mean(age),mean(bmi),mean(chol))
我知道,为了得到我想要的,我可以融化,得到手段,然后dcast如
dm = melt(df,id.var ='sex')
dmg = group_by(dm,sex,variable);
x = summarize(dmg,means = mean(value))
dcast(x,sex〜variable)
但是如果我有> 20个变量和非常大量的行。在data.table中有什么类似于.SD的东西,这将允许我采取分组数据框架中所有变量的方法?
感谢任何帮助
data.table
idiom是 lapply(.SD,mean)
是
DT < - data.table(df)
DT [,lapply(.SD,mean) by = sex]
#sex age bmi chol
#1:boy 55 24 203.5
#2:girl 51 28 197.0
我不确定 dplyr
同义词,但你可以做类似
dg < - group_by(df,sex)
#要汇总的列的名称
cols < - names(dg)[ - 1]
#调用总结的点组件
dots< - sapply(cols,function(x)substitute(mean(x),list = as.name(x))))
do.call(summarize,c(list(.data = dg),dots))
#Source:local data frame [2 x 4]
#sex age bmi chol
#1 boy 55 24 203.5
#2 girl 51 28 197.0
请注意,有一个github问题#178 有效地实现 dplyr
中的 plyr
idiom colwise
/ p>
dplyr is amazingly fast, but I wonder if I'm missing something: is it possible summarise over several variables. For example:
library(dplyr)
library(reshape2)
(df=dput(structure(list(sex = structure(c(1L, 1L, 2L, 2L), .Label = c("boy",
"girl"), class = "factor"), age = c(52L, 58L, 40L, 62L), bmi = c(25L,
23L, 30L, 26L), chol = c(187L, 220L, 190L, 204L)), .Names = c("sex",
"age", "bmi", "chol"), row.names = c(NA, -4L), class = "data.frame")))
sex age bmi chol
1 boy 52 25 187
2 boy 58 23 220
3 girl 40 30 190
4 girl 62 26 204
dg=group_by(df,sex)
With this small dataframe, it's easy to write
summarise(dg,mean(age),mean(bmi),mean(chol))
And I know that to get what I want, I could melt, get the means, and then dcast such as
dm=melt(df, id.var='sex')
dmg=group_by(dm, sex, variable);
x=summarise(dmg, means=mean(value))
dcast(x, sex~variable)
But what if I have >20 variables and a very large number of rows. Is there anything similar to .SD in data.table that would allow me to take the means of all variables in the grouped data frame? Or, is it possible to somehow use lapply on the grouped data frame?
Thanks for any help
The data.table
idiom is lapply(.SD, mean)
, which is
DT <- data.table(df)
DT[, lapply(.SD, mean), by = sex]
# sex age bmi chol
# 1: boy 55 24 203.5
# 2: girl 51 28 197.0
I'm not sure of a dplyr
idiom for the same thing, but you can do something like
dg <- group_by(df, sex)
# the names of the columns you want to summarize
cols <- names(dg)[-1]
# the dots component of your call to summarise
dots <- sapply(cols ,function(x) substitute(mean(x), list(x=as.name(x))))
do.call(summarise, c(list(.data=dg), dots))
# Source: local data frame [2 x 4]
# sex age bmi chol
# 1 boy 55 24 203.5
# 2 girl 51 28 197.0
Note that there is a github issue #178 to efficienctly implement the plyr
idiom colwise
in dplyr
.
这篇关于可以dplyr总结几个变量,而不列出每一个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!