可以dplyr总结几个变量,而不列出每一个? [英] Can dplyr summarise over several variables without listing each one?

查看:232
本文介绍了可以dplyr总结几个变量,而不列出每一个?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

dplyr是惊人的快,但我想知道我是否缺少一些东西:是可能总结几个变量。例如:

 库(dplyr)
库(reshape2)

(df = dput(structure(list(sex = structure(c(1L,1L,2L,2L),.Label = c(boy,
girl),class =factor),age = c 52L,58L,40L,62L),bmi = c(25L,
23L,30L,26L),chol = c(187L,220L,190L,204L)).Names = c(sex b $ bage,bmi,chol),row.names = c(NA,-4L),class =data.frame))

sex age bmi chol
1男孩52 25 187
2男孩58 23 220
3女孩40 30 190
4女孩62 26 204

dg = group_by(df,性别)

使用这个小型数据机,很容易写

  summarize(dg,mean(age),mean(bmi),mean(chol))

我知道,为了得到我想要的,我可以融化,得到手段,然后dcast如

  dm = melt(df,id.var ='sex')
dmg = group_by(dm,sex,variable);
x = summarize(dmg,means = mean(value))
dcast(x,sex〜variable)

但是如果我有> 20个变量和非常大量的行。在data.table中有什么类似于.SD的东西,这将允许我采取分组数据框架中所有变量的方法?



感谢任何帮助

解决方案

data.table idiom是 lapply(.SD,mean)

  DT < -  data.table(df)
DT [,lapply(.SD,mean) by = sex]
#sex age bmi chol
#1:boy 55 24 203.5
#2:girl 51 28 197.0

我不确定 dplyr 同义词,但你可以做类似

  dg < -  group_by(df,sex)
#要汇总的列的名称
cols < - names(dg)[ - 1]
#调用总结的点组件
dots< - sapply(cols,function(x)substitute(mean(x),list = as.name(x))))
do.call(summarize,c(list(.data = dg),dots))
#Source:local data frame [2 x 4]

#sex age bmi chol
#1 boy 55 24 203.5
#2 girl 51 28 197.0

请注意,有一个github问题#178 有效地实现 dplyr 中的 plyr idiom colwise / p>

dplyr is amazingly fast, but I wonder if I'm missing something: is it possible summarise over several variables. For example:

library(dplyr)
library(reshape2)

(df=dput(structure(list(sex = structure(c(1L, 1L, 2L, 2L), .Label = c("boy", 
"girl"), class = "factor"), age = c(52L, 58L, 40L, 62L), bmi = c(25L, 
23L, 30L, 26L), chol = c(187L, 220L, 190L, 204L)), .Names = c("sex", 
"age", "bmi", "chol"), row.names = c(NA, -4L), class = "data.frame")))

   sex age bmi chol
1  boy  52  25  187
2  boy  58  23  220
3 girl  40  30  190
4 girl  62  26  204

dg=group_by(df,sex)

With this small dataframe, it's easy to write

summarise(dg,mean(age),mean(bmi),mean(chol))

And I know that to get what I want, I could melt, get the means, and then dcast such as

dm=melt(df, id.var='sex')
dmg=group_by(dm, sex, variable); 
x=summarise(dmg, means=mean(value))
dcast(x, sex~variable)

But what if I have >20 variables and a very large number of rows. Is there anything similar to .SD in data.table that would allow me to take the means of all variables in the grouped data frame? Or, is it possible to somehow use lapply on the grouped data frame?

Thanks for any help

解决方案

The data.table idiom is lapply(.SD, mean), which is

DT <- data.table(df)
DT[, lapply(.SD, mean), by = sex]
#     sex age bmi  chol
# 1:  boy  55  24 203.5
# 2: girl  51  28 197.0

I'm not sure of a dplyr idiom for the same thing, but you can do something like

dg <- group_by(df, sex)
# the names of the columns you want to summarize
cols <- names(dg)[-1]
# the dots component of your call to summarise
dots <- sapply(cols ,function(x) substitute(mean(x), list(x=as.name(x))))
do.call(summarise, c(list(.data=dg), dots))
# Source: local data frame [2 x 4]

#    sex age bmi  chol
# 1  boy  55  24 203.5
# 2 girl  51  28 197.0

Note that there is a github issue #178 to efficienctly implement the plyr idiom colwise in dplyr.

这篇关于可以dplyr总结几个变量,而不列出每一个?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆