了解dplyr group_by与自定义之间的结果差异 [英] understanding difference in results between dplyr group_by vs tapply

查看：141 发布时间：2017/7/13 21:55:29 group-by dplyr tapply

本文介绍了了解dplyr group_by与自定义之间的结果差异的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我期待在这两场比赛中看到相同的结果，而且它们是不同的。让我有疑问，如果我真的明白了dplyr代码的工作原理（我已经阅读了很多关于dplyr的内容和在线的所有内容）。任何人都可以解释为什么结果不同，或者如何获得类似的结果？

  library（dplyr）
 x< -iris 
x<  -  x％。％
 group_by（Species，Sepal.Width）％。％
总结（freq = n（））％。％
总结（mean_by_group = mean（Sepal.Width））
 print（x）
 
 x <-iris 
 x< -tapply（x $ Sepal.Width，x $ Species，mean）
 print（x）

更新：我不认为这是最有效的方法这样做，但下面的代码给出了一个匹配自定义方法的结果。根据Hadley的建议，我逐个审查了结果，这是最好的，我可以想出使用dplyr

 库（ dplyr）
 x< -iris 
x<  -  x％。％
 group_by（Species，Sepal.Width）％。％
 summarize（freq = n（））％。 ％
 mutate（mean_by_group = sum（Sepal.Width * freq）/ sum（freq））％。％
 print（x）

更新：由于某种原因，我以为我不得不将我想要分析的所有变量分组，这是发送方向错误的方向。这是我需要的，这更接近于包中的示例。

  x<  -  iris％。％
 group_by（种类）％。％
总结（Sepal。宽度=平均值（Sepal.Width））
 print（x）

解决方案

也许这个...

- `dplyr` ：

  require（dplyr）
 
 iris％>％group_by（Species）％>％summary（mean_width = mean（Sepal.Width ））
 
＃资料来源：本地数据框架[3 x 2] 
＃
＃物种平均值_ b $ b＃1 setosa 3.428 
＃2 versicolor 2.770 
＃3 virginica 2.974

- `点击`：

  tapply（iris $ Sepal.Width，iris $ Species，mean）
 
＃setosa versicolor virginica 
＃3.428 2.770 2.974

注意： tapply（）默认简化输出，而 summarize（）不： / h3>

  typeof（tapply（iris $ Sepal.Width，iris $ Species，mean，simplified = TRUE））
 
＃[1]double

它返回一个列表否则： p>

  typeof（tapply（iris $ Sepal.Width，iris $ Species，mean，simplified = FALSE））
 
＃[1]list

所以要实际获得相同类型 tapply（）您需要：

  tbl_df（ 
 data.frame（
 mean_width = tapply（iris $ Sepal.Width，
 iris $ Species，
 mean）））
 
＃来源：本地数据框架[3 x 1] 
＃
＃mean_width 
＃setosa 3.428 
＃versicolor 2.770 
＃virginica 2.974

这仍然是不一样的！因为 unique（iris $ Species）是一个属性，而不是df的一列...

I was expecting to see the same results between these two runs, and they are different. Makes me question if I really understand what how the dplyr code is working (I have read pretty much everything I can find about dplyr in the package and online) . Can anyone explain why the results are different, or how to obtain similar results?
library(dplyr) x<-iris x <- x %.% group_by(Species, Sepal.Width) %.% summarise (freq=n()) %.% summarise (mean_by_group = mean(Sepal.Width)) print(x) x<-iris x<-tapply(x$Sepal.Width, x$Species, mean) print(x)
Update: I don't think this is the most efficient way to do this, but the following code gives a result that matches the tapply approach. Per Hadley's suggestion, I scrutinized the results line by line, and this is the best I could come up with using dplyr
library(dplyr) x<-iris x <- x %.% group_by(Species, Sepal.Width) %.% summarise (freq=n()) %.% mutate (mean_by_group = sum(Sepal.Width*freq)/sum(freq)) %.% print(x)
Update: for some reason I thought I had to group all variables I wanted to analyse, which is what sent things in the wrong direction. This is all I needed, which is closer to the examples in the package.
x<- iris %.% group_by(Species) %.% summarise(Sepal.Width = mean(Sepal.Width)) print(x)

解决方案
Maybe this...

- dplyr:

require(dplyr) iris %>% group_by(Species) %>% summarise(mean_width = mean(Sepal.Width)) # Source: local data frame [3 x 2] # # Species mean_width # 1 setosa 3.428 # 2 versicolor 2.770 # 3 virginica 2.974

- tapply:

tapply(iris$Sepal.Width, iris$Species, mean) # setosa versicolor virginica # 3.428 2.770 2.974

NOTE: tapply() simplifies output by default whereas summarise() does not:

typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=TRUE)) # [1] "double"
it returns a list otherwise:
typeof(tapply(iris$Sepal.Width, iris$Species, mean, simplify=FALSE)) # [1] "list"
So to actually get the same type of output form tapply() you would need:
tbl_df( data.frame( mean_width = tapply( iris$Sepal.Width, iris$Species, mean ))) # Source: local data frame [3 x 1] # # mean_width # setosa 3.428 # versicolor 2.770 # virginica 2.974
and this still isn't the same! as unique(iris$Species) is an attribute here and not a column of the df...

这篇关于了解dplyr group_by与自定义之间的结果差异的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

了解dplyr group_by与自定义之间的结果差异 [英] understanding difference in results between dplyr group_by vs tapply

问题描述

- `dplyr` ：

- `点击`：

- `dplyr`:

- `tapply`:

NOTE: `tapply()` simplifies output by default whereas `summarise()` does not:

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

了解dplyr group_by与自定义之间的结果差异 [英] understanding difference in results between dplyr group_by vs tapply

问题描述

- dplyr ：

- 点击：

- dplyr:

- tapply:

NOTE: tapply() simplifies output by default whereas summarise() does not:

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

- `dplyr` ：

- `点击`：

- `dplyr`:

- `tapply`:

NOTE: `tapply()` simplifies output by default whereas `summarise()` does not:

登录关闭