了解dplyr group_by与自定义之间的结果差异 [英] understanding difference in results between dplyr group_by vs tapply

查看:141
本文介绍了了解dplyr group_by与自定义之间的结果差异的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我期待在这两场比赛中看到相同的结果,而且它们是不同的。让我有疑问,如果我真的明白了dplyr代码的工作原理(我已经阅读了很多关于dplyr的内容和在线的所有内容)。任何人都可以解释为什么结果不同,或者如何获得类似的结果?

  library(dplyr)
x< -iris
x< - x%。%
group_by(Species,Sepal.Width)%。%
总结(freq = n())%。%
总结(mean_by_group = mean(Sepal.Width))
print(x)

x <-iris
x< -tapply(x $ Sepal.Width,x $ Species,mean)
print(x)

更新:我不认为这是最有效的方法这样做,但下面的代码给出了一个匹配自定义方法的结果。根据Hadley的建议,我逐个审查了结果,这是最好的,我可以想出使用dplyr

 库( dplyr)
x< -iris
x< - x%。%
group_by(Species,Sepal.Width)%。%
summarize(freq = n())%。 %
mutate(mean_by_group = sum(Sepal.Width * freq)/ sum(freq))%。%
print(x)

更新:由于某种原因,我以为我不得不将我想要分析的所有变量分组,这是发送方向错误的方向。这是我需要的,这更接近于包中的示例。

  x<  -  iris%。%
group_by(种类)%。%
总结(Sepal。宽度=平均值(Sepal.Width))
print(x)


解决方案

也许这个...



- dplyr



  require(dplyr)

iris%>%group_by(Species)%>%summary(mean_width = mean(Sepal.Width ))

#资料来源:本地数据框架[3 x 2]

#物种平均值_ b $ b#1 setosa 3.428
#2 versicolor 2.770
#3 virginica 2.974



- 点击



  tapply(iris $ Sepal.Width,iris $ Species,mean)

#setosa versicolor virginica
#3.428 2.770 2.974






注意: tapply()默认简化输出,而 summarize()不: / h3>

  typeof(tapply(iris $ Sepal.Width,iris $ Species,mean,simplified = TRUE))

#[1]double

它返回一个列表否则: p>

  typeof(tapply(iris $ Sepal.Width,iris $ Species,mean,simplified = FALSE))

#[1]list

所以要实际获得相同类型 tapply()您需要:

  tbl_df( 
data.frame(
mean_width = tapply(iris $ Sepal.Width,
iris $ Species,
mean)))

#来源:本地数据框架[3 x 1]

#mean_width
#setosa 3.428
#versicolor 2.770
#virginica 2.974

这仍然是不一样的!因为 unique(iris $ Species)是一个属性,而不是df的一列...


I was expecting to see the same results between these two runs, and they are different. Makes me question if I really understand what how the dplyr code is working (I have read pretty much everything I can find about dplyr in the package and online) . Can anyone explain why the results are different, or how to obtain similar results?

library(dplyr)
x<-iris
x <- x %.%
    group_by(Species, Sepal.Width) %.%
    summarise (freq=n()) %.%
   summarise (mean_by_group = mean(Sepal.Width))  
print(x)

x<-iris
x<-tapply(x$Sepal.Width, x$Species, mean)
print(x)

Update: I don't think this is the most efficient way to do this, but the following code gives a result that matches the tapply approach. Per Hadley's suggestion, I scrutinized the results line by line, and this is the best I could come up with using dplyr

    library(dplyr)
    x<-iris
    x <- x %.%
          group_by(Species, Sepal.Width) %.%
         summarise (freq=n()) %.%
         mutate (mean_by_group = sum(Sepal.Width*freq)/sum(freq)) %.%
     print(x)

Update: for some reason I thought I had to group all variables I wanted to analyse, which is what sent things in the wrong direction. This is all I needed, which is closer to the examples in the package.

 x<- iris %.%
   group_by(Species) %.%
  summarise(Sepal.Width = mean(Sepal.Width))
 print(x)

解决方案

Maybe this...

- dplyr:

require(dplyr)

iris %>% group_by(Species) %>% summarise(mean_width = mean(Sepal.Width))

  # Source: local data frame [3 x 2]
  #
  #      Species        mean_width
  # 1     setosa             3.428
  # 2 versicolor             2.770
  # 3  virginica             2.974

- tapply:

tapply(iris$Sepal.Width, iris$Species, mean)

  # setosa versicolor  virginica 
  # 3.428      2.770      2.974 


NOTE: tapply() simplifies output by default whereas summarise() does not:

typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=TRUE))

  # [1] "double"

it returns a list otherwise:

typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=FALSE))

  # [1] "list"

So to actually get the same type of output form tapply() you would need:

tbl_df( 
  data.frame( 
    mean_width = tapply( iris$Sepal.Width, 
                         iris$Species, 
                         mean )))

  # Source: local data frame [3 x 1]
  #
  #            mean_width
  # setosa          3.428
  # versicolor      2.770
  # virginica       2.974

and this still isn't the same! as unique(iris$Species) is an attribute here and not a column of the df...

这篇关于了解dplyr group_by与自定义之间的结果差异的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆