R将间歇性NA值替换为上次结转的结转值(NA.LOCF) [英] R Replace Intermittent NA Values With Last Observation Carried Forward (NA.LOCF)
问题描述
背景
我需要根据NA的性质使用不同的方法来替换数据框中的NA。我的数据框架来自对重复测量的研究,其中一些Na是受试者遗失的结果,而其他Na是间歇性缺失测量(定义为一个或多个缺失测量的序列,然后是测量值)的结果。
我将间歇性缺失测量称为间歇性NA。
问题
我无法测试NA是否是间歇性丢失测量结果的结果,以及应使用哪些功能替换这些NA。理想情况下,我将这些间歇性NA替换为na.locf方法。但是我需要用基准线或观察到的最后一个值(以较大者为准)替换Dropout NA's。
示例
示例1
这是我想要的NA的清晰示例通过na.locf插补被视为间歇性NA:
data.frame(visit = c(1,2,3 ,4,5,6,7,8,9,10),value = c(34,NA,NA,15,16,19,NA,12,23,31))
以及我希望最终结果如何:
data.frame(visit = c(1,2,3,4,5,6,7,8,9,10),value = c(34,34,34,15,16,19, 19,12,23,31))
示例2
以下是我想由先前的非NA观测值或基准值(请访问1)估算的NA(缺失NA)的清晰示例),以最大者为准:
data.frame(visit = c(1,2,3,4,5,6, 7,8,9,10),value = c(34,22,18,15,16,19,NA,NA,NA,NA))
$ p $ >
以及我希望最终结果如何:
data.frame(visit = c(1,2,3,4,5,6,7,8,9,10),值= c(34,22,18,15,16,19,34,34,34,34))
示例3
这是需要不同归因的NA的混合的一个复杂示例,此处先前的非NA观察值大于掉落NA的基线观察值(访问1):
data.frame(visit = c(1,2,3,4,5,6,7,8,9,10),值= c(34,NA,NA,42,16,19,NA,38,NA,NA))
我希望结果如何:
data.frame(visit = c(1,2,3, 4,5,6,7,8,9,10),value = c(34,34,34,42,16,19,19,38,38,38))
示例4
另一个复杂的示例,其中基线观察值(访问1)大于先前的非NA值,用于丢弃NA:
data .frame(visit = c(1,2,3,4,5,6,7 ,8,9,10),value = c(40,NA,NA,42,16,19,NA,38,NA,NA))
我需要结果如何:
data.frame(visit = c(1,2,3,4,5,6,7,8,9,10),值= c(40,40,40,42,16,19,19,38,40,40))
我尝试过的事情
如@Gregor所建议,在我说这可以解决我的问题后,可以用以下方法测试间歇性NA的存在:
mutate(is.na(value)& !is.na(lead(value))
但这不能帮助我估算所有间歇性NA尤其是顺序(NA1,NA2,NA3,14)中的间歇性NA,在运行此测试后仅将NA3返回为TRUE。
方案我们可以使用
na.locf(...,fromLast = TRUE)
来识别尾随的NA
值,并在基线上使用pmax
。我们将以一个很好的整体格式展示您问题中的示例:#合并示例数据
dd = data.frame(
示例= rep(1:3,每个= 10),
访问= rep(1:10,3),
value = c(34,NA, NA,15,16,19,NA,12,23,31,
34,22,18,15,16,19,NA,NA,NA,NA,
34,NA,NA, 42,16,19,NA,38,NA,NA),
目标= c(34,34,34,15,16,19,19,12,23,31,
34,22 ,18,15,16,19,34,34,34,34,
34,34,34,42,16,19,19,38,38,38)
)
库(dplyr)
dd = dd%>%group_by(示例)%&%% b $ b mutate(to_fill =!is.na(zoo :: na.locf(value,fromLast = TRUE,na.rm = FALSE)),
结果= if_else(to_fill,
zoo :: na.locf(value,na.rm = FALSE),
pmax(first(value ),zoo :: na.locf(value,na.rm = FALSE))),
)
all(dd $ goal == dd $ result)
#[ 1] TRUE
如您所见,
resul t
与goal
列完全匹配。Background
I neeed to replace the NA's in my data frame by using different methods depending on the NA's nature. My data frame come from a study with repeated measures, where some of the Na's are a result of subjects dropping out while others are a result of intermittent missing measurements, defined as one or a sequence of multiple missing measurements, followed by a measured value. I will be referring to intermittent missing measurements as intermittent NA's.
Problem
I am having trouble testing whether the NA's are the result of intermittent missing measurements, and what functions I should use to replace these NA's with. I would ideally replace these intermittent NA's with the na.locf method. But I need Dropout NA's to be replaced with the baseline OR the last value observed, whichever is greater.
Examples
Example 1
Here is a clean example of NA's that I want to be treated as intermittent NA's with the na.locf imputation:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,NA,NA,15,16,19,NA,12,23,31))
and how I want it the end result to be:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,34,34,15,16,19,19,12,23,31))
Example 2
Here is a clean example of NA's (dropout NA's) that I want to be imputed by the previous non-NA observation OR the baseline value (visit 1), whichever is greatest:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,22,18,15,16,19,NA,NA,NA,NA))
And how I want the end result to be:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,22,18,15,16,19,34,34,34,34))
Example 3
Here is a complex example of a mixture of NA's which need different imputations, here where the previous non-NA observation is greater than the baseline observation (visit 1) for the dropout NA's:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,NA,NA,42,16,19,NA,38,NA,NA))
How I need the result to be:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,34,34,42,16,19,19,38,38,38))
Example 4
Another complex example where the baseline observation (visit 1) is greater than the previous non-NA value for the dropout NA's:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(40,NA,NA,42,16,19,NA,38,NA,NA))
How I need the result to be:
data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(40,40,40,42,16,19,19,38,40,40))
What I have tried
As suggested by @Gregor, upon me stating that this would solve my problems, it was possible to test for the presence of intermittent NA's with:
mutate(is.na(value) & !is.na(lead(value))
But this does not help me with imputing all intermittent NA's and in particular, intermittent NA's that are in a sequence (NA1,NA2,NA3,14), where only NA3 is returned as TRUE after running this test.
解决方案We can use
na.locf(..., fromLast = TRUE)
to identify the trailingNA
values and usepmax
on them with the baseline. We'll demonstrate on the examples from your question in a nice all-together format:# consolidate example data dd = data.frame( example = rep(1:3, each = 10), visit = rep(1:10, 3), value = c(34,NA,NA,15,16,19,NA,12,23,31, 34,22,18,15,16,19,NA,NA,NA,NA, 34,NA,NA,42,16,19,NA,38,NA,NA), goal = c(34,34,34,15,16,19,19,12,23,31, 34,22,18,15,16,19,34,34,34,34, 34,34,34,42,16,19,19,38,38,38) ) library(dplyr) dd = dd %>% group_by(example) %>% mutate(to_fill = !is.na(zoo::na.locf(value, fromLast = TRUE, na.rm = FALSE)), result = if_else(to_fill, zoo::na.locf(value, na.rm = FALSE), pmax(first(value), zoo::na.locf(value, na.rm = FALSE))), ) all(dd$goal == dd$result) # [1] TRUE
As you can see, the
result
matches thegoal
column perfectly.这篇关于R将间歇性NA值替换为上次结转的结转值(NA.LOCF)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!