dplyr summarize（）和summarise_each（）对所提供的函数进行额外调用 [英] dplyr summarise() and summarise_each() make extra calls to the provided functions

查看：2504 发布时间：2017/7/13 21:00:31 r dplyr

本文介绍了dplyr summarize（）和summarise_each（）对所提供的函数进行额外调用的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

似乎总结和 summarise_each 正在对它们提供的回调函数进行不必要的额外调用。假设我们有以下

  X<  -  data.frame（Group = rep（c（G1，G2 ），2：3），Var1 = 1：5，Var2 = 11:15）

看起来像这样：

进一步假设我们有一个（潜在的昂贵的）功能

  f<  -  function（v）
 {
 cat（Calling f with vector，v，\\\
）
 ## ...额外的簿记和处理... 
表示（v）
}

我们要应用于每个组中的每个变量。使用 dplyr ，我们可以通过以下方式进行说明：

  X％>％group_by（Group）％>％summarise_each（funs（f））

，输出显示，在G1中，每个变量调用 f 一个额外的时间：

 用向量1调用f 2 
用向量1调用f 2 
用向量3调用f 4 4 
用向量11调用f 12 
用向量11调用f 12 
使用向量13调用f 14 14 15 
＃A小巧：2 x 3 
组Var1 Var2 
< fctr> < DBL> < DBL> 
 1 G1 1.5 11.5 
 2 G2 4.0 14.0

同样的问题出现当使用总结：

 > X％>％group_by（Group）％>％summaryize（test = f（Var1））
用向量1调用f 2 
用向量1调用f 2 
使用向量调用f 3 4 5 
＃A tibble：2×2 
组测试
< fctr> < DBL> 
 1 G1 1.5 
 2 G2 4.0

为什么会发生这种情况？一个关于防止总结和 summarise_each 从额外的电话？

（这是使用 R 版本3.3.0和 dplyr 版本0.5.0）

编辑：似乎这个问题与 group_by 和总结 / summarise_each 。没有分组，没有额外的电话。另外， mutate 和 mutate_each 不会遇到此问题。（信用： eddi 和 eipi10 对于这些发现）

解决方案

虽然这个问题仍然存在于dplyr 0.5.0 24），它固定在gitHub的dplyr中。这是在这个提交在第09-24节修改的。我已经确认，当我在以前的提交中检出并构建版本时，我可以重现这个问题，而不是在从那个或后续的提交中构建时。

（是的，我在找到它之前尝试了一大堆其他的，为什么我去这样的长度希望赚取想像的互联网点，我离开特别是在函数 SEXP process_data（const Data& gdf）中， inst / include / dplyr / Result / CallbackProcessor.h ，请注意这些更改：

  CLASS * obj = static_cast< CLASS *>（this）; 
 typename Data :: group_iterator git = gdf.group_begin（）; 
 
 RObject first_result = obj-> process_chunk（* git）; 
 ++ git; //这行添加了

和

  for（int i = 1; i  RObject chunk = obj-> ; process_chunk（* GIT）;

[我添加的评论，不是实际来源的一部分]

It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following

X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )

which looks like this:

   Group Var1 Var2
 1    G1    1   11
 2    G1    2   12
 3    G2    3   13
 4    G2    4   14
 5    G2    5   15

Further suppose that we have a (potentially expensive) function

f <- function(v)
{
   cat( "Calling f with vector", v, "\n" )
   ## ...additional bookkeeping and processing...
   mean(v)
}

that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:

X %>% group_by( Group ) %>% summarise_each( funs(f) )

However, the output shows that f was called one additional time for each variable in G1:

Calling f with vector 1 2 
Calling f with vector 1 2 
Calling f with vector 3 4 5 
Calling f with vector 11 12 
Calling f with vector 11 12 
Calling f with vector 13 14 15 
# A tibble: 2 x 3
   Group  Var1  Var2
  <fctr> <dbl> <dbl> 
1     G1   1.5  11.5
2     G2   4.0  14.0

The same issue is present when using summarize:

> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
   Group  test
  <fctr> <dbl>
1     G1   1.5
2     G2   4.0

Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?

(This is using R version 3.3.0 and dplyr version 0.5.0)

EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)

解决方案

Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.

(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)

In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:

  CLASS* obj = static_cast<CLASS*>(this);
  typename Data::group_iterator git = gdf.group_begin();

  RObject first_result = obj->process_chunk(*git);
  ++git; // This line was added

and

  for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
    RObject chunk = obj->process_chunk(*git);

[Comments added by me, not part of the actual source]

这篇关于dplyr summarize（）和summarise_each（）对所提供的函数进行额外调用的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

dplyr summarize（）和summarise_each（）对所提供的函数进行额外调用 [英] dplyr summarise() and summarise_each() make extra calls to the provided functions

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

dplyr summarize（）和summarise_each（）对所提供的函数进行额外调用 [英] dplyr summarise() and summarise_each() make extra calls to the provided functions

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭