dplyr中的自定义和函数返回不一致的结果 [英] Custom sum function in dplyr returns inconsistent results
问题描述
我做了一个自定义和函数,忽略 NA
,除非所有的都是 NA
。当我在 dplyr
中使用它时,会返回奇怪的结果,我不知道为什么。
I've made a custom sum function that ignores NA
s unless all are NA
. When I use it in dplyr
it returns odd results and I don't know why.
require(dplyr)
dta <- data.frame(year=2007:2013, rrconf=c(79, NaN ,474,2792,1686,3313,3456), enrolled=c(NaN,NaN,458,1222,1155,1906,2184))
sum0 <- function(x, ...){
# remove NAs unless all are NA
if(is.na(mean(x, na.rm=TRUE))) return(NA)
else(sum(x, ..., na.rm=TRUE))
}
dta %>%
group_by(year) %>%
summarize(rrconf=sum0(rrconf), enrolled=sum0(enrolled))
给我
Source: local data frame [7 x 3]
year rrconf enrolled
1 2007 79 NA
2 2008 NA NA
3 2009 474 TRUE
4 2010 2792 TRUE
5 2011 1686 TRUE
6 2012 3313 TRUE
7 2013 3456 TRUE
在这种情况下,它只是对一个值进行求和,但在我的更大的应用程序中夏天超过多个值。将 sum0
函数包裹在 as.integer()
似乎修复了,但我不能告诉你为什么。
In this case it is only summing over one value, but in my bigger application in might summer over multiple values. Wrapping my sum0
function in as.integer()
seems to fix it, but I couldn't tell you why.
这是解决这个问题的正确方法吗?有没有什么明显的我失踪了?
Is this the correct way to work around this problem? Is there something obvious I'm missing?
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: i386-w64-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.2
loaded via a namespace (and not attached):
[1] assertthat_0.1 magrittr_1.0.1 parallel_3.1.0 Rcpp_0.11.2 tools_3.1.0
推荐答案
问题似乎是与 dplyr
确定列类型参考首先返回结果。如果您强制将 NA
值(默认值为逻辑值)为 NA_real _
或 NA_integer _
,那么你将被排序:
The issue seems to be with dplyr
determining the column type in reference to the first returned result. If you force the NA
value, which is by default a logical value, to be an NA_real_
or NA_integer_
, then you will be sorted:
##Just to show what NA normally does first:
class(NA)
#[1] "logical"
sum0 <- function(x, ...){
# remove NAs unless all are NA
if(is.na(mean(x, na.rm=TRUE))) return(NA_real_)
else(sum(x, ..., na.rm=TRUE))
}
dta %>%
group_by(year) %>%
summarize(rrconf=sum0(rrconf), enrolled=sum0(enrolled))
#Source: local data frame [7 x 3]
#
# year rrconf enrolled
#1 2007 79 NA
#2 2008 NA NA
#3 2009 474 458
#4 2010 2792 1222
#5 2011 1686 1155
#6 2012 3313 1906
#7 2013 3456 2184
这篇关于dplyr中的自定义和函数返回不一致的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!