按类别填写数据表中的缺失(空白) - 向后和向前 [英] Filling in missing (blanks) in a data table, per category - backwards and forwards

查看:12
本文介绍了按类别填写数据表中的缺失(空白) - 向后和向前的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理我 11 年以上临床实践的大量账单记录数据集.相当多的行缺少转诊医生.但是,使用一些规则我可以​​很容易地填写它们,但不知道如何在 R 下的 data.table 中实现它.我知道 zoo 包中有 na.locf 之类的东西,data.table 包中的自滚动连接.我看到的例子太简单了,对我没有帮助.

I am working with a large data set of billing records for my clinical practice over 11 years. Quite a few of the rows are missing the referring physician. However, using some rules I can quite easily fill them in but do not know how to implement it in data.table under R. I know that there are things such as na.locf in the zoo package and self rolling join in the data.table package. The examples that I have seen are too simplistic and do not help me.

这里有一些虚构的数据来指导你(作为 dput ASCII 文本表示)

Here is some fictitious data to orient you (as a dput ASCII text representation)

    structure(list(patient.first.name = structure(c(1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("John", "Kathy", 
"Timothy"), class = "factor"), patient.last.name = structure(c(3L, 
3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Jones", 
"Martinez", "Squeal"), class = "factor"), medical.record.nr = c(4563455, 
4563455, 4563455, 4563455, 4563455, 2663775, 2663775, 2663775, 
2663775, 2663775, 3330956, 3330956, 3330956, 3330956), date.of.service = c(39087, 
39112, 39112, 39130, 39228, 39234, 39244, 39244, 39262, 39360, 
39184, 39194, 39198, 39216), procedure.code = c(44750, 38995, 
40125, 44720, 44729, 44750, 38995, 40125, 44720, 44729, 44750, 
44729, 44729, 44729), diagnosis.code.1 = c(456.87, 456.87, 456.87, 
456.87, 456.87, 521.37, 521.37, 521.37, 521.37, 356.36, 456.87, 
456.87, 456.87, 456.87), diagnosis.code.2 = c(413, 413, 413, 
413, 413, 532.23, NA, NA, NA, NA, NA, NA, NA, NA), referring.doctor.first = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Abe", 
"Mark"), class = "factor"), referring.doctor.last = structure(c(1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, NA, NA, NA, 1L, 1L, NA), .Label = c("Newstead", 
"Wydell"), class = "factor"), referring.docotor.zip = c(15209, 
15209, 15209, 15209, 15209, 15222, 15222, 15222, NA, NA, NA, 
15209, 15209, NA), some.other.stuff = structure(c(1L, 1L, 1L, 
NA, 3L, NA, NA, 4L, NA, 6L, NA, 2L, 5L, NA), .Label = c("alkjkdkdio", 
"cheerios", "ddddd", "dddddd", "dogs", "lkjljkkkkk"), class = "factor")), .Names = c("patient.first.name", 
"patient.last.name", "medical.record.nr", "date.of.service", 
"procedure.code", "diagnosis.code.1", "diagnosis.code.2", "referring.doctor.first", 
"referring.doctor.last", "referring.docotor.zip", "some.other.stuff"
), row.names = c(NA, 14L), class = "data.frame")

显而易见的解决方案是在 refer.doctor.last 和 refer.doctor.first 上使用某种最后观察结转 (LOCF) 算法.但是,当它到达新患者时,它必须停止.换言之,LOCF 必须仅应用于由患者.first.name、患者.last.name、medical.record.nr 的组合标识的一位患者.另请注意,有些患者在第一次就诊时错过了转诊医生,这意味着必须向后进行一些观察.使事情复杂化的是,一些患者更换初级保健医生,因此可能会在更早的时候有一位转诊医生,而在以后可能会有另一位转诊医生.因此,该算法需要了解具有缺失值的行的日期顺序.

The obvious solution is to use some sort of last observation carried forward (LOCF) algorithm on referring.doctor.last and referring.doctor.first. However, it must stop when it gets to a new patient. In other words the LOCF must only be applied to one patient who is identified by the combination of patient.first.name, patient.last.name, medical.record.nr. Also note how some patients are missing the referring doctor on their very first visit so that means that some observations have to be carried backwards. To complicate matters some patients change primary care physicians and so there may be one referring doctor earlier on and another one later on. The alogorithm therefore needs to be aware of the date order of the rows with missing values.

在动物园 na.locf 我没有看到一种简单的方法来对每个患者的 LOCF 进行分组.我看到的滚动连接示例在这里不起作用,因为我不能简单地取出缺少 refering.doctor 信息的行,因为我会丢失 date.of.service 和 procedure.code 等.我希望您能帮助我了解 R 如何填补我缺失的数据.

In zoo na.locf I do not see an easy way to group the LOCF per patient. The rolling join examples that I have seen, would not work here becasuse I cannot simply take out the rows with the missing referring.doctor information since I would then loose date.of.service and procedure.code etcetera. I would love your help in learning how R can fill in my missing data.

推荐答案

@MatthewDowle 为我们提供了一个很好的起点,在这里我们将得出结论.

@MatthewDowle has provided us with a wonderful starting point and here we will take it to its conclusion.

简而言之,使用 zoona.locf.该问题不适用于滚动连接.

In a nutshell, use zoo's na.locf. The problem is not amenable to rolling joins.

setDT(bill)
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE),
     by=list(patient.last.name, patient.first.name, medical.record.nr)]
bill[,referring.doctor.last:=na.locf(referring.doctor.last,na.rm=FALSE,fromLast=TRUE),
     by=list(patient.last.name, patient.first.name, medical.record.nr)]

然后为 referring.doctor.first

几点建议:

  1. by 语句确保最后的观察结果仅限于同一患者,因此该结果不会流血"到列表中的下一位患者.

  1. The by statement ensures that the last observation carried forward is restricted to the same patient so that the carrying does not "bleed" into the next patient on the list.

必须使用 na.rm=FALSE 参数.如果不这样做,那么在第一次就诊时缺少转诊医生信息的患者将删除 NA 并且新值的向量(现有 + 结转)将是一个缺少的元素行数.缩短的向量被回收,所有内容都向上移动,最后一行在回收时获取向量的第一个元素.换句话说,大乱.最糟糕的是,您有时只会看到它.

One must use the na.rm=FALSE argument. If one does not then a patient who is missing information for a referring physician on their very first visit will have the NA removed and the vector of new values (existing + carried forward) will be one element short of the number of rows. The shortened vector is recycled and everything gets shifted up and the last row gets the first element of the vector as it is recycled. In other words, a big mess. And worst of all you will only see it sometimes.

使用 fromLast=TRUE 再次遍历该列.这会填充任何数据之前的 NA.动物园使用下一个后向观察(NOCB)代替最后一次观察结转(LOCF).幸福——您现在已经以适合大多数情况的方式填写了缺失的数据.

Use fromLast=TRUE to run through the column again. That fills in the NA that preceded any data. Instead of last observation carried forward (LOCF) zoo uses next observation carried backward (NOCB). Happiness - you have now filled in the missing data in a way that is correct for most circumstances.

您可以在每行传递多个 :=,例如DT[,`:=`(new=1L,new2=2L,...)]

You can pass multiple := per line, e.g. DT[,`:=`(new=1L,new2=2L,...)]

这篇关于按类别填写数据表中的缺失(空白) - 向后和向前的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆