R将间歇性NA值替换为上次结转的结转值(NA.LOCF) [英] R Replace Intermittent NA Values With Last Observation Carried Forward (NA.LOCF)

查看:280
本文介绍了R将间歇性NA值替换为上次结转的结转值(NA.LOCF)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

背景



我需要根据NA的性质使用不同的方法来替换数据框中的NA。我的数据框架来自对重复测量的研究,其中一些Na是受试者遗失的结果,而其他Na是间歇性缺失测量(定义为一个或多个缺失测量的序列,然后是测量值)的结果。
我将间歇性缺失测量称为间歇性NA。



问题



我无法测试NA是否是间歇性丢失测量结果的结果,以及应使用哪些功能替换这些NA。理想情况下,我将这些间歇性NA替换为na.locf方法。但是我需要用基准线或观察到的最后一个值(以较大者为准)替换Dropout NA's。



示例



示例1



这是我想要的NA的清晰示例通过na.locf插补被视为间歇性NA:

  data.frame(visit = c(1,2,3 ,4,5,6,7,8,9,10),value = c(34,NA,NA,15,16,19,NA,12,23,31))

以及我希望最终结果如何:

  data.frame(visit = c(1,2,3,4,5,6,7,8,9,10),value = c(34,34,34,15,16,19, 19,12,23,31))

示例2



以下是我想由先前的非NA观测值或基准值(请访问1)估算的NA(缺失NA)的清晰示例),以最大者为准:

  data.frame(visit = c(1,2,3,4,5,6, 7,8,9,10),value = c(34,22,18,15,16,19,NA,NA,NA,NA))


以及我希望最终结果如何:

  data.frame(visit = c(1,2,3,4,5,6,7,8,9,10),值= c(34,22,18,15,16,19,34,34,34,34))

示例3



这是需要不同归因的NA的混合的一个复杂示例,此处先前的非NA观察值大于掉落NA的基线观察值(访问1):

  data.frame(visit = c(1,2,3,4,5,6,7,8,9,10),值= c(34,NA,NA,42,16,19,NA,38,NA,NA))

我希望结果如何:

  data.frame(visit = c(1,2,3, 4,5,6,7,8,9,10),value = c(34,34,34,42,16,19,19,38,38,38))

示例4



另一个复杂的示例,其中基线观察值(访问1)大于先前的非NA值,用于丢弃NA:

  data .frame(visit = c(1,2,3,4,5,6,7 ,8,9,10),value = c(40,NA,NA,42,16,19,NA,38,NA,NA))

我需要结果如何:

  data.frame(visit = c(1,2,3,4,5,6,7,8,9,10),值= c(40,40,40,42,16,19,19,38,40,40))






我尝试过的事情



如@Gregor所建议,在我说这可以解决我的问题后,可以用以下方法测试间歇性NA的存在:

  mutate(is.na(value)& !is.na(lead(value))

但这不能帮助我估算所有间歇性NA尤其是顺序(NA1,NA2,NA3,14)中的间歇性NA,在运行此测试后仅将NA3返回为TRUE。

方案

我们可以使用 na.locf(...,fromLast = TRUE)来识别尾随的 NA 值,并在基线上使用 pmax 。我们将以一个很好的整体格式展示您问题中的示例:

 #合并示例数据
dd = data.frame(
示例= rep(1:3,每个= 10),
访问= rep(1:10,3),
value = c(34,NA, NA,15,16,19,NA,12,23,31,
34,22,18,15,16,19,NA,NA,NA,NA,
34,NA,NA, 42,16,19,NA,38,NA,NA),
目标= c(34,34,34,15,16,19,19,12,23,31,
34,22 ,18,15,16,19,34,34,34,34,
34,34,34,42,16,19,19,38,38,38)


库(dplyr)
dd = dd%>%group_by(示例)%&%% b $ b mutate(to_fill =!is.na(zoo :: na.locf(value,fromLast = TRUE,na.rm = FALSE)),
结果= if_else(to_fill,
zoo :: na.locf(value,na.rm = FALSE),
pmax(first(value ),zoo :: na.locf(value,na.rm = FALSE))),


all(dd $ goal == dd $ result)
#[ 1] TRUE

如您所见, resul t goal 列完全匹配。


Background

I neeed to replace the NA's in my data frame by using different methods depending on the NA's nature. My data frame come from a study with repeated measures, where some of the Na's are a result of subjects dropping out while others are a result of intermittent missing measurements, defined as one or a sequence of multiple missing measurements, followed by a measured value. I will be referring to intermittent missing measurements as intermittent NA's.

Problem

I am having trouble testing whether the NA's are the result of intermittent missing measurements, and what functions I should use to replace these NA's with. I would ideally replace these intermittent NA's with the na.locf method. But I need Dropout NA's to be replaced with the baseline OR the last value observed, whichever is greater.

Examples

Example 1

Here is a clean example of NA's that I want to be treated as intermittent NA's with the na.locf imputation:

data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,NA,NA,15,16,19,NA,12,23,31))

and how I want it the end result to be:

data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,34,34,15,16,19,19,12,23,31))

Example 2

Here is a clean example of NA's (dropout NA's) that I want to be imputed by the previous non-NA observation OR the baseline value (visit 1), whichever is greatest:

data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,22,18,15,16,19,NA,NA,NA,NA))

And how I want the end result to be:

data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,22,18,15,16,19,34,34,34,34))

Example 3

Here is a complex example of a mixture of NA's which need different imputations, here where the previous non-NA observation is greater than the baseline observation (visit 1) for the dropout NA's:

data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,NA,NA,42,16,19,NA,38,NA,NA))

How I need the result to be:

data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(34,34,34,42,16,19,19,38,38,38))

Example 4

Another complex example where the baseline observation (visit 1) is greater than the previous non-NA value for the dropout NA's:

data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(40,NA,NA,42,16,19,NA,38,NA,NA))

How I need the result to be:

data.frame(visit=c(1,2,3,4,5,6,7,8,9,10),value=c(40,40,40,42,16,19,19,38,40,40))


What I have tried

As suggested by @Gregor, upon me stating that this would solve my problems, it was possible to test for the presence of intermittent NA's with:

mutate(is.na(value) & !is.na(lead(value))

But this does not help me with imputing all intermittent NA's and in particular, intermittent NA's that are in a sequence (NA1,NA2,NA3,14), where only NA3 is returned as TRUE after running this test.

解决方案

We can use na.locf(..., fromLast = TRUE) to identify the trailing NA values and use pmax on them with the baseline. We'll demonstrate on the examples from your question in a nice all-together format:

# consolidate example data
dd = data.frame(
  example = rep(1:3, each = 10),
  visit = rep(1:10, 3),
  value = c(34,NA,NA,15,16,19,NA,12,23,31,
            34,22,18,15,16,19,NA,NA,NA,NA,
            34,NA,NA,42,16,19,NA,38,NA,NA),
  goal = c(34,34,34,15,16,19,19,12,23,31,
           34,22,18,15,16,19,34,34,34,34,
           34,34,34,42,16,19,19,38,38,38)
)

library(dplyr)
dd = dd %>% group_by(example) %>%
  mutate(to_fill = !is.na(zoo::na.locf(value, fromLast = TRUE, na.rm = FALSE)),
         result = if_else(to_fill,
                          zoo::na.locf(value, na.rm = FALSE),
                          pmax(first(value), zoo::na.locf(value, na.rm = FALSE))),
    )

all(dd$goal == dd$result)
# [1] TRUE

As you can see, the result matches the goal column perfectly.

这篇关于R将间歇性NA值替换为上次结转的结转值(NA.LOCF)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆