使用参与者自己的数据估算值进行生存分析的数据清理 [英] Data Cleaning for Survival Analysis Using a Participant's Own Data to Impute Values

查看:106
本文介绍了使用参与者自己的数据估算值进行生存分析的数据清理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在清理一些数据以进行生存分析,我正在努力做到这一点,以便根据给定主题内的周围值估算丢失的数据。我想为参与者使用最接近的前一个值和最接近的后一个值的平均值。如果没有后续值存在,那么我想使用结转的先前值,直到出现后续值为止。

I’m in the process of cleaning some data for a survival analysis and I am trying to make it so that missing data gets imputed based on the surrounding values within a given subject. I'd like to use the mean of the closest previous and closest subsequent values for the participant. If there is no subsequent value present, then I'd like to use the previous value carried forward until a subsequent value is present.

我一直在尝试打破问题分成更小的,更易于管理的操作和对象,但是,我不断采用的解决方案迫使我根据丢失值的正上方和正下方的行使用条件格式,老实说,我有点茫然关于如何做到这一点。如果您认为您知道我可以使用,尝试的一种好的技术,或者如果您知道在查找解决方案时可以使用的任何好的搜索词,我将为您提供一些指导。

I’ve been trying to break the problem apart into smaller, more manageable operations and objects, however, the solutions I keep coming to force me to use conditional formatting based on rows immediately above and below the a missing value and, quite frankly, I’m at a bit of a loss as to how to do this. I would love a little guidance if you think you know of a good technique I can use, experiment with, or if you know of any good search terms I can use when looking up a solution.

详细信息如下:

#Fake dataset creation
id <- c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4)
time <-c(0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6,0,1,2,3,4,5,6)
ss <- c(2,2,4,3,NA,0,0,1,4,0,NA,0,0,0,4,2,1,3,3,2,NA,3,4,3,NA,NA,0,0)
mydat <- data.frame(id, time, ss)

*加粗和带下划线的字符表示来自上述数据集的更改

*Bold and underlined characters represent changes from the dataset above

这里的目标是找到一种方法来获取ID#1(变量ss)的NA值,如下所示:2,2,4,3, 1.5 ,0,0

The goal here is to find a way to get the NA values for ID #1 (variable ss) to look like this: 2,2,4,3,1.5,0,0

ID#2(变量ss)看起来像这样:1,4,0​​, 0 ,0,0,0

ID# 2 (variable ss) to look like this: 1,4,0,0,0,0,0

ID#3(变量ss)看起来像这样:4,2,1,3 ,3,2,NA(因为行与NA最终将被删除)

ID #3 (variable ss) to look like this: 4,2,1,3,3,2,NA (no change because the row with NA will be deleted eventually)

ID#4(变量ss)看起来像这样:3,4,3, 3 1.5 ,0,0(此操作需要进行多项更改,我认为这是最具挑战性的解决方案。)

ID #4 (variable ss) to look like this: 3,4,3,3,1.5,0,0 (this one requires multiple changes and I expect it is the most challenging to tackle).

推荐答案

如果不是处理速度的问题(我想 ID#4很难对插补进行矢量化处理),则可以尝试:

If processing speed is not the issue (I guess "ID #4" makes it hard to vectorize imputations), then maybe try:

f <- function(x) {
  idx <- which(is.na(x))
  for (id in idx) {
    sel <- x[id+c(-1,1)]
    if (id < length(x)) 
      sel <- sel[!is.na(sel)]
    x[id] <- mean(sel)
  }
  return(x)                 
}
cbind(mydat, ss_imp=ave(mydat$ss, mydat$id, FUN=f))
#    id time ss ss_imp
# 11  1    0  2    2.0
# 12  1    1  2    2.0
# 13  1    2  4    4.0
# 14  1    3  3    3.0
# 15  1    4 NA    1.5
# 16  1    5  0    0.0
# 17  1    6  0    0.0
# 21  2    0  1    1.0
# 22  2    1  4    4.0
# 23  2    2  0    0.0
# 24  2    3 NA    0.0
# 25  2    4  0    0.0
# 26  2    5  0    0.0
# 27  2    6  0    0.0
# 31  3    0  4    4.0
# 32  3    1  2    2.0
# 33  3    2  1    1.0
# 34  3    3  3    3.0
# 35  3    4  3    3.0
# 36  3    5  2    2.0
# 37  3    6 NA     NA
# 41  4    0  3    3.0
# 42  4    1  4    4.0
# 43  4    2  3    3.0
# 44  4    3 NA    3.0
# 45  4    4 NA    1.5
# 46  4    5  0    0.0
# 47  4    6  0    0.0

这篇关于使用参与者自己的数据估算值进行生存分析的数据清理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆