dplyr summarize()和summarise_each()对所提供的函数进行额外调用 [英] dplyr summarise() and summarise_each() make extra calls to the provided functions
问题描述
似乎总结
和 summarise_each
正在对它们提供的回调函数进行不必要的额外调用。假设我们有以下
X< - data.frame(Group = rep(c(G1,G2 ),2:3),Var1 = 1:5,Var2 = 11:15)
看起来像这样:
组Var1 Var2
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15
进一步假设我们有一个(潜在的昂贵的)功能
f< - function(v)
{
cat(Calling f with vector,v,\\\
)
## ...额外的簿记和处理...
表示(v)
}
我们要应用于每个组中的每个变量。使用 dplyr
,我们可以通过以下方式进行说明:
X%>%group_by(Group)%>%summarise_each(funs(f))
,输出显示,在G1中,每个变量调用 f
一个额外的时间:
用向量1调用f 2
用向量1调用f 2
用向量3调用f 4 4
用向量11调用f 12
用向量11调用f 12
使用向量13调用f 14 14 15
#A小巧:2 x 3
组Var1 Var2
< fctr> < DBL> < DBL>
1 G1 1.5 11.5
2 G2 4.0 14.0
同样的问题出现当使用总结
:
> X%>%group_by(Group)%>%summaryize(test = f(Var1))
用向量1调用f 2
用向量1调用f 2
使用向量调用f 3 4 5
#A tibble:2×2
组测试
< fctr> < DBL>
1 G1 1.5
2 G2 4.0
为什么会发生这种情况?一个关于防止总结
和 summarise_each
从额外的电话?
(这是使用 R
版本3.3.0和 dplyr
版本0.5.0)
编辑:似乎这个问题与 group_by
和总结
/ summarise_each
。没有分组,没有额外的电话。另外, mutate
和 mutate_each
不会遇到此问题。 (信用: eddi 和 eipi10 对于这些发现)
虽然这个问题仍然存在于dplyr 0.5.0 24),它固定在gitHub的dplyr中。这是在这个提交在第09-24节修改的。我已经确认,当我在以前的提交中检出并构建版本时,我可以重现这个问题,而不是在从那个或后续的提交中构建时。
(是的,我在找到它之前尝试了一大堆其他的,为什么我去这样的长度希望赚取想像的互联网点,我离开特别是在函数 SEXP process_data(const Data& gdf)
中, inst / include / dplyr / Result / CallbackProcessor.h
,请注意这些更改:
CLASS * obj = static_cast< CLASS *>(this);
typename Data :: group_iterator git = gdf.group_begin();
RObject first_result = obj-> process_chunk(* git);
++ git; //这行添加了
和
for(int i = 1; i RObject chunk = obj-> ; process_chunk(* GIT);
[我添加的评论,不是实际来源的一部分]
It seems that summarise
and summarise_each
are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following
X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )
which looks like this:
Group Var1 Var2
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15
Further suppose that we have a (potentially expensive) function
f <- function(v)
{
cat( "Calling f with vector", v, "\n" )
## ...additional bookkeeping and processing...
mean(v)
}
that we would like to apply to each of our variables in each group. Using dplyr
, we might go about it in the following way:
X %>% group_by( Group ) %>% summarise_each( funs(f) )
However, the output shows that f
was called one additional time for each variable in G1:
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
Calling f with vector 11 12
Calling f with vector 11 12
Calling f with vector 13 14 15
# A tibble: 2 x 3
Group Var1 Var2
<fctr> <dbl> <dbl>
1 G1 1.5 11.5
2 G2 4.0 14.0
The same issue is present when using summarize
:
> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
Group test
<fctr> <dbl>
1 G1 1.5
2 G2 4.0
Why is this happening and how would one go about preventing summarise
and summarise_each
from making those extra calls?
(This is using R
version 3.3.0 and dplyr
version 0.5.0)
EDIT: It appears that the issue has to do with the interplay between group_by
and summarise
/summarise_each
. Without the grouping, no extra calls are made. Also, mutate
and mutate_each
do not suffer from this issue. (Credit: eddi and eipi10 for these findings)
Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.
(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)
In particular, in the function SEXP process_data(const Data& gdf)
in inst/include/dplyr/Result/CallbackProcessor.h
, note these changes:
CLASS* obj = static_cast<CLASS*>(this);
typename Data::group_iterator git = gdf.group_begin();
RObject first_result = obj->process_chunk(*git);
++git; // This line was added
and
for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
RObject chunk = obj->process_chunk(*git);
[Comments added by me, not part of the actual source]
这篇关于dplyr summarize()和summarise_each()对所提供的函数进行额外调用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!