使用dplyr对多列求和时忽略NA [英] Ignoring NA when summing multiple columns with dplyr

查看:48
本文介绍了使用dplyr对多列求和时忽略NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对多个列进行求和,有些包含NA.我正在使用

I am summing across multiple columns, some that have NA. I am using

 dplyr::mutate

,然后写出列的算术总和以获得总和.但是列中有NA,我想将它们视为零.我能够使其与rowSums一起使用(请参见下文),但现在使用了mutate.使用mutate可以使其更具可读性,但同时也可以使我减去列.示例如下.

and then writing out the arithmetic sum of the columns to get the sum. But the columns have NA and I would like to treat them as zero. I was able to get it to work with rowSums (see below), but now using mutate. Using mutate allows to make it more readable, but can also allow me to subtract columns. The example is below.

require(dplyr)
data(iris)
iris <- tbl_df(iris)
iris[2,3] <- NA
iris <- mutate(iris, sum = Sepal.Length + Petal.Length)

如何确保上述表达式中Petal.Length中的NA被处理为零?我知道使用rowSums可以做类似的事情:

How do I ensure that NA in Petal.Length is handled as zero in the above expression? I know using rowSums I can do something like:

iris$sum <- rowSums(DF[,c("Sepal.Length","Petal.Length")], na.rm = T)

但是使用mutate可以更容易地设置diff = Sepal.Length-Petal.Length.用mutate完成此操作的建议方法是什么?

but with mutate it is easier to set even diff = Sepal.Length - Petal.Length. What would be a suggested way to accomplish this using mutate?

请注意,该帖子类似于以下stackoverflow帖子.

Note the post is similar to below stackoverflow posts.

使用dplyr跨多列求和

减去多个列而忽略NA

推荐答案

rowSums 的问题是对 DF (未定义)的引用.这有效:

The problem with your rowSums is the reference to DF (which is undefined). This works:

mutate(iris, sum2 = rowSums(cbind(Sepal.Length, Petal.Length), na.rm = T))

为了区别,您当然可以使用负数: rowSums(cbind(Sepal.Length,-Petal.Length),na.rm = T)

For difference, you could of course use a negative: rowSums(cbind(Sepal.Length, -Petal.Length), na.rm = T)

一般的解决方案是使用 ifelse 或类似的方法将缺失值设置为0(或其他合适的值):

The general solution is to use ifelse or similar to set the missing values to 0 (or whatever else is appropriate):

mutate(iris, sum2 = Sepal.Length + ifelse(is.na(Petal.Length), 0, Petal.Length))

ifelse 更有效的是 coalesce 的实现,请查看示例这里.这将使用上一个链接中的@krlmlr答案(请参见代码底部,或使用 kimisc包).

More efficient than ifelse would be an implementation of coalesce, see examples here. This uses @krlmlr's answer from the previous link (see bottom for the code or use the kimisc package).

mutate(iris, sum2 = Sepal.Length + coalesce.na(Petal.Length, 0))

要替换整个数据集中的缺失值,请在 tidyr 包中添加 replace_na .

To replace missing values data-set wide, there is replace_na in the tidyr package.

@krlmlr的 coalesce.na 在此处找到

@krlmlr's coalesce.na, as found here

coalesce.na <- function(x, ...) {
  x.len <- length(x)
  ly <- list(...)
  for (y in ly) {
    y.len <- length(y)
    if (y.len == 1) {
      x[is.na(x)] <- y
    } else {
      if (x.len %% y.len != 0)
        warning('object length is not a multiple of first object length')
      pos <- which(is.na(x))
      x[pos] <- y[(pos - 1) %% y.len + 1]
    }
  }
  x
}

这篇关于使用dplyr对多列求和时忽略NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆