用最接近的值替换R中的NA [英] Replacing NAs in R with nearest value
问题描述
我正在寻找与zoo
包中的na.locf()
类似的东西,但我不想一直使用 previous 非NA
值,而是想使用最近非NA
值.一些示例数据:
I'm looking for something similar to na.locf()
in the zoo
package, but instead of always using the previous non-NA
value I'd like to use the nearest non-NA
value. Some example data:
dat <- c(1, 3, NA, NA, 5, 7)
将NA
替换为na.locf
(结转3个):
Replacing NA
with na.locf
(3 is carried forward):
library(zoo)
na.locf(dat)
# 1 3 3 3 5 7
将fromLast
设置为TRUE
的
和na.locf
(5向后携带):
and na.locf
with fromLast
set to TRUE
(5 is carried backwards):
na.locf(dat, fromLast = TRUE)
# 1 3 5 5 5 7
但是我希望使用最近的非NA
值.在我的示例中,这意味着应将3向前携带到第一个NA
,将5向后携带到第二个NA
:
But I wish the nearest non-NA
value to be used. In my example this means that the 3 should be carried forward to the first NA
, and the 5 should be carried backwards to the second NA
:
1 3 3 5 5 7
我已经编写了一个解决方案,但是想确保我没有重新发明轮子.已经有东西漂浮了吗?
I have a solution coded up, but wanted to make sure that I wasn't reinventing the wheel. Is there something already floating around?
仅供参考,我当前的代码如下.也许没有别的,有人可以建议如何提高它的效率.我觉得我缺少一种明显的改进方法:
FYI, my current code is as follows. Perhaps if nothing else, someone can suggest how to make it more efficient. I feel like I'm missing an obvious way to improve this:
na.pos <- which(is.na(dat))
if (length(na.pos) == length(dat)) {
return(dat)
}
non.na.pos <- setdiff(seq_along(dat), na.pos)
nearest.non.na.pos <- sapply(na.pos, function(x) {
return(which.min(abs(non.na.pos - x)))
})
dat[na.pos] <- dat[non.na.pos[nearest.non.na.pos]]
要回答以下smci的问题:
To answer smci's questions below:
- 否,任何条目都可以不适用
- 如果全部都不适用,请保留
- 不.我当前的解决方案默认为最接近的左侧值,但这没关系
- 这些行通常是几十万个元素,因此理论上上限是几十万个.实际上,这里只不过是少数几个而已.在那里,通常是一个.
更新因此,事实证明,我们完全朝着不同的方向发展,但这仍然是一个有趣的讨论.谢谢大家!
Update So it turns out that we're going in a different direction altogether but this was still an interesting discussion. Thanks all!
推荐答案
这是一个非常快的方法.它使用 findInterval
来查找应该定位的两个位置考虑原始数据中的每个NA
:
Here is a very fast one. It uses findInterval
to find what two positions should be considered for each NA
in your original data:
f1 <- function(dat) {
N <- length(dat)
na.pos <- which(is.na(dat))
if (length(na.pos) %in% c(0, N)) {
return(dat)
}
non.na.pos <- which(!is.na(dat))
intervals <- findInterval(na.pos, non.na.pos,
all.inside = TRUE)
left.pos <- non.na.pos[pmax(1, intervals)]
right.pos <- non.na.pos[pmin(N, intervals+1)]
left.dist <- na.pos - left.pos
right.dist <- right.pos - na.pos
dat[na.pos] <- ifelse(left.dist <= right.dist,
dat[left.pos], dat[right.pos])
return(dat)
}
在这里我对其进行测试:
And here I test it:
# sample data, suggested by @JeffAllen
dat <- as.integer(runif(50000, min=0, max=10))
dat[dat==0] <- NA
# computation times
system.time(r0 <- f0(dat)) # your function
# user system elapsed
# 5.52 0.00 5.52
system.time(r1 <- f1(dat)) # this function
# user system elapsed
# 0.01 0.00 0.03
identical(r0, r1)
# [1] TRUE
这篇关于用最接近的值替换R中的NA的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!