在数据表中填入每个类别 - 向后和向前缺失(空白) [英] Filling in missing (blanks) in a data table, per category - backwards and forwards

查看:82
本文介绍了在数据表中填入每个类别 - 向后和向前缺失(空白)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用11年以上的我的临床实践的大量数据集结算记录。相当多的行缺少转诊医生。但是,使用一些规则,我可以很容易地填充他们,但不知道如何实现它在data.table下R.我知道有一些事情,如 na.locf 在动物园包和自动滚动连接在data.table包。



这里是一些虚拟的数据来定位你(作为一个dput ASCII文本表示)

 结构(list(patient.first.name = structure(c(1L,1L,1L,1L,
1L,2L, 2L,2L,2L,2L,3L,3L,3L,3L),.Label = c(John,Kathy,
Timothy),class =factor),patient.last。 name = structure(c(3L,
3L,3L,3L,3L,2L,2L,2L,2L,2L,1L,1L,1L,1L),.Label = c $ bMartinez,Squeal),class =factor),medical.record.nr = c(4563455,
4563455,4563455,4563455,4563455,2663775,2663775,2663775,
2663775,2663775,3330956,3330956,3330956,3330956),date.of.service = c(39087,
39112,39112,39130,39228,39234,39244,39244,39262,39360,
39184 ,39194,39198,39216),procedure.code = c(44750,38995,
40125,44720,44729,44750,38995,40125,44720,44729,44750,
44729,44729,44729) ,diagnosis.code.1 = c(456.87,456.87,456.87,
456.87,456.87,521.37,521.37,521.37,521.37,356.36,456.87,
456.87,456.87,456.87),诊断代码。 2 = c(413,413,413,
413,413,532.23,NA,NA,NA,NA,NA,NA,NA,NA),references.doctor.first = structure(c b $ b 1L,1L,1L,1L,2L,2L,2L,NA,NA,NA,1L,1L,NA),.Label = c(Abe,
Mark 因子),参考.doctor.last = structure(c(1L,
1L,1L,1L,1L,2L,2L,2L,NA,NA,NA,1L,1L,NA) = c(Newstead,
Wydell),class =factor),refer.docotor.zip = c(15209,
15209,15209,15209,15209,15222,15222,15222 ,NA,NA,NA,
15209,15209,NA),some.other.stuff = structure(c(1L,1L,1L,
NA,3L,NA,NA,4L, 6,NA,2L,5L,NA),.Label = c(alkjkdkdio,
cheerios,ddddd,dddddd,dogs,lkjljkkkkk),class = )),.Names = c(patient.first.name,
patient.last.name,medical.record.nr,date.of.service,
procedure .code,diagnosis.code.1,diagnosis.code.2,referencing.doctor.first,
refer.doctor.last,refer.docotor.zip,some .other.stuff
),row.names = c(NA,14L),class =data.frame)

显而易见的解决方案是在references.doctor.last和references.doctor.first上使用某种最后一次观察结束(LOCF)算法。然而,它必须停止,当它得到一个新的病人。换句话说,LOCF只能应用于由patient.first.name,patient.last.name,medical.record.nr的组合标识的一个患者。还要注意,一些患者在第一次访问时错过了推荐医生,这意味着一些观察必须向后。使事情变得复杂一些患者改变初级保健医生,因此可能有一个转诊医生和另一个后来。因此,算法需要知道具有缺失值的行的日期顺序。



在动物园 na.locf 我没有看到一个简单的方法来分组每个病人的LOCF。我看到的滚动连接示例,不会在这里工作,因为我不能简单地拿出缺少refer.doctor信息的行,因为我会松开date.of.service和procedure.code etcetera。

解决方案

@MatthewDowle为我们提供了一个很好的开始



简而言之,使用动物园 na.locf

  setDT(bill)
bill [,refer.doctor.last: = na.locf(refer.doctor.last,na.rm = FALSE),
by = list(patient.last.name,patient.first.name,medical.record.nr)]
bill [,refer.doast.last:= na.locf(refer.doctor.last,na.rm = FALSE,fromLast = TRUE),
by = list(patient.last.name,patient.first.name, medical.record.nr)]

然后对 .first



几个指针:


  1. by 语句确保最后一次观察结果限于同一位患者,以便携带者不会泄漏列表中的下一位患者。 / p>


  2. 必须使用 na.rm = FALSE 参数。如果没有,那么在转诊医生第一次访问时缺少信息的患者将移除 NA ,并且新值的向量(现有+结转)将是少于行数的一个元素。缩短的向量被循环,并且一切都向上移动,并且最后一行在被循环时获得向量的第一元素。换句话说,一个大乱。


  3. 使用 fromLast = TRUE 列。它填充了任何数据之前的NA。代替最后观察结转(LOCF)动物园使用下一次观察背面(NOCB)。幸福 - 你现在填写了丢失的数据,在大多数情况下是正确的。


  4. 每行可以传送多个:= DT [,`:=`(new = 1L,new2 = 2L,...)]



I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However, using some rules I can quite easily fill them in but do not know how to implement it in data.table under R. I know that there are things such as na.locf in the zoo package and self rolling join in the data.table package. The examples that I have seen are too simplistic and do not help me.

Here is some fictitious data to orient you (as a dput ASCII text representation)

    structure(list(patient.first.name = structure(c(1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("John", "Kathy", 
"Timothy"), class = "factor"), patient.last.name = structure(c(3L, 
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Jones", 
"Martinez", "Squeal"), class = "factor"), medical.record.nr = c(4563455, 
4563455, 4563455, 4563455, 4563455, 2663775, 2663775, 2663775, 
2663775, 2663775, 3330956, 3330956, 3330956, 3330956), date.of.service = c(39087, 
39112, 39112, 39130, 39228, 39234, 39244, 39244, 39262, 39360, 
39184, 39194, 39198, 39216), procedure.code = c(44750, 38995, 
40125, 44720, 44729, 44750, 38995, 40125, 44720, 44729, 44750, 
44729, 44729, 44729), diagnosis.code.1 = c(456.87, 456.87, 456.87, 
456.87, 456.87, 521.37, 521.37, 521.37, 521.37, 356.36, 456.87, 
456.87, 456.87, 456.87), diagnosis.code.2 = c(413, 413, 413, 
413, 413, 532.23, NA, NA, NA, NA, NA, NA, NA, NA), referring.doctor.first = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Abe", 
"Mark"), class = "factor"), referring.doctor.last = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Newstead", 
"Wydell"), class = "factor"), referring.docotor.zip = c(15209, 
15209, 15209, 15209, 15209, 15222, 15222, 15222, NA, NA, NA, 
15209, 15209, NA), some.other.stuff = structure(c(1L, 1L, 1L, 
NA, 3L, NA, NA, 4L, NA, 6L, NA, 2L, 5L, NA), .Label = c("alkjkdkdio", 
"cheerios", "ddddd", "dddddd", "dogs", "lkjljkkkkk"), class = "factor")), .Names = c("patient.first.name", 
"patient.last.name", "medical.record.nr", "date.of.service", 
"procedure.code", "diagnosis.code.1", "diagnosis.code.2", "referring.doctor.first", 
"referring.doctor.last", "referring.docotor.zip", "some.other.stuff"
), row.names = c(NA, 14L), class = "data.frame")

The obvious solution is to use some sort of last observation carried forward (LOCF) algorithm on referring.doctor.last and referring.doctor.first. However, it must stop when it gets to a new patient. In other words the LOCF must only be applied to one patient who is identified by the combination of patient.first.name, patient.last.name, medical.record.nr. Also note how some patients are missing the referring doctor on their very first visit so that means that some observations have to be carried backwards. To complicate matters some patients change primary care physicians and so there may be one referring doctor earlier on and another one later on. The alogorithm therefore needs to be aware of the date order of the rows with missing values.

In zoo na.locf I do not see an easy way to group the LOCF per patient. The rolling join examples that I have seen, would not work here becasuse I cannot simply take out the rows with the missing referring.doctor information since I would then loose date.of.service and procedure.code etcetera. I would love your help in learning how R can fill in my missing data.

解决方案

@MatthewDowle has provided us with a wonderful starting point and here we will take it to its conclusion.

In a nutshell, use zoo's na.locf. The problem is not amenable to rolling joins.

setDT(bill)
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE),
     by=list(patient.last.name, patient.first.name, medical.record.nr)]
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE,fromLast=TRUE),
     by=list(patient.last.name, patient.first.name, medical.record.nr)]

Then do something similar for referring.doctor.first

A few pointers:

  1. The by statement ensures that the last observation carried forward is restricted to the same patient so that the carrying does not "bleed" into the next patient on the list.

  2. One must use the na.rm=FALSE argument. If one does not then a patient who is missing information for a referring physician on their very first visit will have the NA removed and the vector of new values (existing + carried forward) will be one element short of the number of rows. The shortened vector is recycled and everything gets shifted up and the last row gets the first element of the vector as it is recycled. In other words, a big mess. And worst of all you will only see it sometimes.

  3. Use fromLast=TRUE to run through the column again. That fills in the NA that preceded any data. Instead of last observation carried forward (LOCF) zoo uses next observation carried backward (NOCB). Happiness - you have now filled in the missing data in a way that is correct for most circumstances.

  4. You can pass multiple := per line, e.g. DT[,`:=`(new=1L,new2=2L,...)]

这篇关于在数据表中填入每个类别 - 向后和向前缺失(空白)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆