dplyr summarize()和summarise_each()对所提供的函数进行额外调用 [英] dplyr summarise() and summarise_each() make extra calls to the provided functions

查看:2504
本文介绍了dplyr summarize()和summarise_each()对所提供的函数进行额外调用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎总结 summarise_each 正在对它们提供的回调函数进行不必要的额外调用。假设我们有以下

  X<  -  data.frame(Group = rep(c(G1,G2 ),2:3),Var1 = 1:5,Var2 = 11:15)

看起来像这样:

 组Var1 Var2 
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15

进一步假设我们有一个(潜在的昂贵的)功能

  f<  -  function(v)
{
cat(Calling f with vector,v,\\\

## ...额外的簿记和处理...
表示(v)
}

我们要应用于每个组中的每个变量。使用 dplyr ,我们可以通过以下方式进行说明:

  X%>%group_by(Group)%>%summarise_each(funs(f))

,输出显示,在G1中,每个变量调用 f 一个额外的时间:

 用向量1调用f 2 
用向量1调用f 2
用向量3调用f 4 4
用向量11调用f 12
用向量11调用f 12
使用向量13调用f 14 14 15
#A小巧:2 x 3
组Var1 Var2
< fctr> < DBL> < DBL>
1 G1 1.5 11.5
2 G2 4.0 14.0

同样的问题出现当使用总结

 > X%>%group_by(Group)%>%summaryize(test = f(Var1))
用向量1调用f 2
用向量1调用f 2
使用向量调用f 3 4 5
#A tibble:2×2
组测试
< fctr> < DBL>
1 G1 1.5
2 G2 4.0

为什么会发生这种情况?一个关于防止总结 summarise_each 从额外的电话?



(这是使用 R 版本3.3.0和 dplyr 版本0.5.0)



编辑:似乎这个问题与 group_by 总结 / summarise_each 。没有分组,没有额外的电话。另外, mutate mutate_each 不会遇到此问题。 (信用: eddi eipi10 对于这些发现)

解决方案

虽然这个问题仍然存在于dplyr 0.5.0 24),它固定在gitHub的dplyr中。这是在这个提交在第09-24节修改的。我已经确认,当我在以前的提交中检出并构建版本时,我可以重现这个问题,而不是在从那个或后续的提交中构建时。



(是的,我在找到它之前尝试了一大堆其他的,为什么我去这样的长度希望赚取想像的互联网点,我离开特别是在函数 SEXP process_data(const Data& gdf)中, inst / include / dplyr / Result / CallbackProcessor.h ,请注意这些更改:

  CLASS * obj = static_cast< CLASS *>(this); 
typename Data :: group_iterator git = gdf.group_begin();

RObject first_result = obj-> process_chunk(* git);
++ git; //这行添加了

  for(int i = 1; i  RObject chunk = obj-> ; process_chunk(* GIT); 

[我添加的评论,不是实际来源的一部分]


It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following

X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )

which looks like this:

   Group Var1 Var2
 1    G1    1   11
 2    G1    2   12
 3    G2    3   13
 4    G2    4   14
 5    G2    5   15

Further suppose that we have a (potentially expensive) function

f <- function(v)
{
   cat( "Calling f with vector", v, "\n" )
   ## ...additional bookkeeping and processing...
   mean(v)
}

that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:

X %>% group_by( Group ) %>% summarise_each( funs(f) )

However, the output shows that f was called one additional time for each variable in G1:

Calling f with vector 1 2 
Calling f with vector 1 2 
Calling f with vector 3 4 5 
Calling f with vector 11 12 
Calling f with vector 11 12 
Calling f with vector 13 14 15 
# A tibble: 2 x 3
   Group  Var1  Var2
  <fctr> <dbl> <dbl> 
1     G1   1.5  11.5
2     G2   4.0  14.0

The same issue is present when using summarize:

> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
   Group  test
  <fctr> <dbl>
1     G1   1.5
2     G2   4.0

Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?

(This is using R version 3.3.0 and dplyr version 0.5.0)

EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)

解决方案

Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.

(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)

In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:

  CLASS* obj = static_cast<CLASS*>(this);
  typename Data::group_iterator git = gdf.group_begin();

  RObject first_result = obj->process_chunk(*git);
  ++git; // This line was added

and

  for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
    RObject chunk = obj->process_chunk(*git);

[Comments added by me, not part of the actual source]

这篇关于dplyr summarize()和summarise_each()对所提供的函数进行额外调用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆