了解dplyr group_by与自定义之间的结果差异 [英] understanding difference in results between dplyr group_by vs tapply
问题描述
library(dplyr)
x< -iris
x< - x%。%
group_by(Species,Sepal.Width)%。%
总结(freq = n())%。%
总结(mean_by_group = mean(Sepal.Width))
print(x)
x <-iris
x< -tapply(x $ Sepal.Width,x $ Species,mean)
print(x)
更新:我不认为这是最有效的方法这样做,但下面的代码给出了一个匹配自定义方法的结果。根据Hadley的建议,我逐个审查了结果,这是最好的,我可以想出使用dplyr
库( dplyr)
x< -iris
x< - x%。%
group_by(Species,Sepal.Width)%。%
summarize(freq = n())%。 %
mutate(mean_by_group = sum(Sepal.Width * freq)/ sum(freq))%。%
print(x)
更新:由于某种原因,我以为我不得不将我想要分析的所有变量分组,这是发送方向错误的方向。这是我需要的,这更接近于包中的示例。
x< - iris%。%
group_by(种类)%。%
总结(Sepal。宽度=平均值(Sepal.Width))
print(x)
也许这个...
- dplyr
:
require(dplyr)
iris%>%group_by(Species)%>%summary(mean_width = mean(Sepal.Width ))
#资料来源:本地数据框架[3 x 2]
#
#物种平均值_ b $ b#1 setosa 3.428
#2 versicolor 2.770
#3 virginica 2.974
- 点击
:
tapply(iris $ Sepal.Width,iris $ Species,mean)
#setosa versicolor virginica
#3.428 2.770 2.974
注意: tapply()
默认简化输出,而 summarize()
不: / h3>
typeof(tapply(iris $ Sepal.Width,iris $ Species,mean,simplified = TRUE))
#[1]double
typeof(tapply(iris $ Sepal.Width,iris $ Species,mean,simplified = TRUE))
#[1]double
它返回一个列表
否则: p>
typeof(tapply(iris $ Sepal.Width,iris $ Species,mean,simplified = FALSE))
#[1]list
所以要实际获得相同类型 tapply()您需要:
tbl_df(
data.frame(
mean_width = tapply(iris $ Sepal.Width,
iris $ Species,
mean)))
#来源:本地数据框架[3 x 1]
#
#mean_width
#setosa 3.428
#versicolor 2.770
#virginica 2.974
这仍然是不一样的!因为 unique(iris $ Species)
是一个属性
,而不是df的一列...
I was expecting to see the same results between these two runs, and they are different. Makes me question if I really understand what how the dplyr code is working (I have read pretty much everything I can find about dplyr in the package and online) . Can anyone explain why the results are different, or how to obtain similar results?
library(dplyr)
x<-iris
x <- x %.%
group_by(Species, Sepal.Width) %.%
summarise (freq=n()) %.%
summarise (mean_by_group = mean(Sepal.Width))
print(x)
x<-iris
x<-tapply(x$Sepal.Width, x$Species, mean)
print(x)
Update: I don't think this is the most efficient way to do this, but the following code gives a result that matches the tapply approach. Per Hadley's suggestion, I scrutinized the results line by line, and this is the best I could come up with using dplyr
library(dplyr)
x<-iris
x <- x %.%
group_by(Species, Sepal.Width) %.%
summarise (freq=n()) %.%
mutate (mean_by_group = sum(Sepal.Width*freq)/sum(freq)) %.%
print(x)
Update: for some reason I thought I had to group all variables I wanted to analyse, which is what sent things in the wrong direction. This is all I needed, which is closer to the examples in the package.
x<- iris %.%
group_by(Species) %.%
summarise(Sepal.Width = mean(Sepal.Width))
print(x)
Maybe this...
- dplyr
:
require(dplyr)
iris %>% group_by(Species) %>% summarise(mean_width = mean(Sepal.Width))
# Source: local data frame [3 x 2]
#
# Species mean_width
# 1 setosa 3.428
# 2 versicolor 2.770
# 3 virginica 2.974
- tapply
:
tapply(iris$Sepal.Width, iris$Species, mean)
# setosa versicolor virginica
# 3.428 2.770 2.974
NOTE: tapply()
simplifies output by default whereas summarise()
does not:
typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=TRUE))
# [1] "double"
it returns a list
otherwise:
typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=FALSE))
# [1] "list"
So to actually get the same type of output form tapply()
you would need:
tbl_df(
data.frame(
mean_width = tapply( iris$Sepal.Width,
iris$Species,
mean )))
# Source: local data frame [3 x 1]
#
# mean_width
# setosa 3.428
# versicolor 2.770
# virginica 2.974
and this still isn't the same! as unique(iris$Species)
is an attribute
here and not a column of the df...
这篇关于了解dplyr group_by与自定义之间的结果差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!