dplyr中的自定义和函数返回不一致的结果 [英] Custom sum function in dplyr returns inconsistent results

查看:123
本文介绍了dplyr中的自定义和函数返回不一致的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我做了一个自定义和函数,忽略 NA ,除非所有的都是 NA 。当我在 dplyr 中使用它时,会返回奇怪的结果,我不知道为什么。

I've made a custom sum function that ignores NAs unless all are NA. When I use it in dplyr it returns odd results and I don't know why.

require(dplyr)

dta <- data.frame(year=2007:2013, rrconf=c(79, NaN ,474,2792,1686,3313,3456), enrolled=c(NaN,NaN,458,1222,1155,1906,2184))

sum0 <- function(x, ...){
  # remove NAs unless all are NA
  if(is.na(mean(x, na.rm=TRUE))) return(NA)
  else(sum(x, ..., na.rm=TRUE))
} 

dta %>%
  group_by(year) %>%
  summarize(rrconf=sum0(rrconf), enrolled=sum0(enrolled))

给我

Source: local data frame [7 x 3]

  year rrconf enrolled
1 2007     79       NA
2 2008     NA       NA
3 2009    474     TRUE
4 2010   2792     TRUE
5 2011   1686     TRUE
6 2012   3313     TRUE
7 2013   3456     TRUE

在这种情况下,它只是对一个值进行求和,但在我的更大的应用程序中夏天超过多个值。将 sum0 函数包裹在 as.integer()似乎修复了,但我不能告诉你为什么。

In this case it is only summing over one value, but in my bigger application in might summer over multiple values. Wrapping my sum0 function in as.integer() seems to fix it, but I couldn't tell you why.

这是解决这个问题的正确方法吗?有没有什么明显的我失踪了?

Is this the correct way to work around this problem? Is there something obvious I'm missing?

> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: i386-w64-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.2

loaded via a namespace (and not attached):
[1] assertthat_0.1 magrittr_1.0.1 parallel_3.1.0 Rcpp_0.11.2    tools_3.1.0 


推荐答案

问题似乎是与 dplyr 确定列类型参考首先返回结果。如果您强制将 NA 值(默认值为逻辑值)为 NA_real _ NA_integer _ ,那么你将被排序:

The issue seems to be with dplyr determining the column type in reference to the first returned result. If you force the NA value, which is by default a logical value, to be an NA_real_ or NA_integer_, then you will be sorted:

##Just to show what NA normally does first:
class(NA)
#[1] "logical"

sum0 <- function(x, ...){
  # remove NAs unless all are NA
  if(is.na(mean(x, na.rm=TRUE))) return(NA_real_)
  else(sum(x, ..., na.rm=TRUE))
} 

dta %>%
  group_by(year) %>%
  summarize(rrconf=sum0(rrconf), enrolled=sum0(enrolled))

#Source: local data frame [7 x 3]
# 
#  year rrconf enrolled
#1 2007     79       NA
#2 2008     NA       NA
#3 2009    474      458
#4 2010   2792     1222
#5 2011   1686     1155
#6 2012   3313     1906
#7 2013   3456     2184

这篇关于dplyr中的自定义和函数返回不一致的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆