用最接近的值替换R中的NA [英] Replacing NAs in R with nearest value

查看:110
本文介绍了用最接近的值替换R中的NA的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找与zoo包中的na.locf()类似的东西,但我不想一直使用 previous NA值,而是想使用最近NA值.一些示例数据:

I'm looking for something similar to na.locf() in the zoo package, but instead of always using the previous non-NA value I'd like to use the nearest non-NA value. Some example data:

dat <- c(1, 3, NA, NA, 5, 7)

NA替换为na.locf(结转3个):

Replacing NA with na.locf (3 is carried forward):

library(zoo)
na.locf(dat)
# 1 3 3 3 5 7

fromLast设置为TRUE

na.locf(5向后携带):

and na.locf with fromLast set to TRUE (5 is carried backwards):

na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7

但是我希望使用最近的NA值.在我的示例中,这意味着应将3向前携带到第一个NA,将5向后携带到第二个NA:

But I wish the nearest non-NA value to be used. In my example this means that the 3 should be carried forward to the first NA, and the 5 should be carried backwards to the second NA:

1 3 3 5 5 7

我已经编写了一个解决方案,但是想确保我没有重新发明轮子.已经有东西漂浮了吗?

I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?

仅供参考,我当前的代码如下.也许没有别的,有人可以建议如何提高它的效率.我觉得我缺少一种明显的改进方法:

FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:

  na.pos <- which(is.na(dat))
  if (length(na.pos) == length(dat)) {
    return(dat)
  }
  non.na.pos <- setdiff(seq_along(dat), na.pos)
  nearest.non.na.pos <- sapply(na.pos, function(x) {
    return(which.min(abs(non.na.pos - x)))
  })
  dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]

要回答以下smci的问题:

To answer smci's questions below:

  1. 否,任何条目都可以不适用
  2. 如果全部都不适用,请保留
  3. 不.我当前的解决方案默认为最接近的左侧值,但这没关系
  4. 这些行通常是几十万个元素,因此理论上上限是几十万个.实际上,这里只不过是少数几个而已.在那里,通常是一个.

更新因此,事实证明,我们完全朝着不同的方向发展,但这仍然是一个有趣的讨论.谢谢大家!

Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!

推荐答案

这是一个非常快的方法.它使用 findInterval 来查找应该定位的两个位置考虑原始数据中的每个NA:

Here is a very fast one. It uses findInterval to find what two positions should be considered for each NA in your original data:

f1 <- function(dat) {
  N <- length(dat)
  na.pos <- which(is.na(dat))
  if (length(na.pos) %in% c(0, N)) {
    return(dat)
  }
  non.na.pos <- which(!is.na(dat))
  intervals  <- findInterval(na.pos, non.na.pos,
                             all.inside = TRUE)
  left.pos   <- non.na.pos[pmax(1, intervals)]
  right.pos  <- non.na.pos[pmin(N, intervals+1)]
  left.dist  <- na.pos - left.pos
  right.dist <- right.pos - na.pos

  dat[na.pos] <- ifelse(left.dist <= right.dist,
                        dat[left.pos], dat[right.pos])
  return(dat)
}

在这里我对其进行测试:

And here I test it:

# sample data, suggested by @JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA

# computation times
system.time(r0 <- f0(dat))    # your function
# user  system elapsed 
# 5.52    0.00    5.52
system.time(r1 <- f1(dat))    # this function
# user  system elapsed 
# 0.01    0.00    0.03
identical(r0, r1)
# [1] TRUE

这篇关于用最接近的值替换R中的NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆