R:在数据帧B中填充行之前的日期使用来自数据帧A的值 [英] R: Using values from data frame A from a date prior to populate a row in data frame B
问题描述
数据:
数据帧A :
按患者ID列出所有输血日期。每次输血均由单独的一行表示,患者可以多次输血。不同的病人在同一天可以输血。
患者ID Transfusion.Date
1 01/01/2000
1 01/30/2000
2 04/01/2003
3 04/01/2003
B类包含其他日期的测试结果,也包括患者ID:
患者ID Test.Date Test.Value
1 11/30/1999负
1 01/15/2000 700份/ uL
1 01/27/2000 900份/ uL
2 03/30/2003负
我想要的是数据帧A具有相同的行数(每次输注1个),并且最近的Test.Value作为一个单独的列。每个输血日期应具有最接近输血的测试结果(之前)。
所需输出:
- >
< p $ p>
患者ID Transfusion.Date Pre.Transfusion.Test
1 01/01/2000负
1 01/30/2000 900份/ ul
2 04/01/2003负
3 04/01/2003 NA
我认为一般策略将是按照患者ID对数据框架进行子集。然后取病人1的所有输血日期,检查哪个结果最接近每个元素的所有可用测试日期,然后返回最接近的值。
我如何解释R
编辑1 :以下是这些示例的R代码
df_A< - data.frame(MRN = c(1,1,2,3),
Transfusion.Date = as.Date(c('01 / 01 / 2000','01 / 30/2000',
'04 / 01/2003','04/01/2003'),'%m /%d /%Y'))
df_B< - data.frame(MRN = c(1,1,1,2),
Test.Date = as.Date(c('11 / 30/1999','01 / 15 / 2000','01 / 27/2000',
'03 / 30/2003'),'%m /%d /%Y'),Test.Result = c('negative',
'700份/ ul','900份/ ul','negative'))
编辑2:
为了澄清,结果数据应该是:患者A接受输血在第X天和第Y天(对于df_A)。在第X天输液之前,他最近的检测结果是X(最初的输血测试日期,在df_B中)。在第Y天输血之前,他最近的测试结果是Y(在第二次输血之前,也在df_B中,df_B还包含一些其他测试日期,这些测试日期并不是最终的输出。
这里使用 data.table
的滚动联接:
require(data.table)
setkey(setDT(df_A),MRN,Transfusion.Date)
setkey(setDT(df_B) ,MRN,Test.Date)
df_B [df_A,roll = TRUE]
#MRN Test.Date Test.Result
#1:1 2000-01-01 negative
#2:1 2000-01-30 900份/ ul
#3:2 2003-04-01负
#4:3 2003-04-01 NA
-
setDT
convertdata.frame
todata.table
通过引用(没有任何额外的复制),这将导致df_A
和df_B
现在是data.tables。 -
setkey
排序数据.table
由我们提供的列和标记这些列作为关键列,这允许我们使用基于二进制搜索的连接。 -
我们在关键列上执行表单
x [i]
的连接,其中每行i
,匹配的行x
(如果有的话,其他NA)以及i
的行被返回。这就是我们所说的均衡加入。通过添加roll = TRUE
,如果不匹配,最后一次观察结束(LOCF)。这就是我们所说的滚动加入。以递增顺序排序(由于setkey()
)确保最后一次观察是最近的日期。
HTH
This may be very complicated and I suspect requires advanced knowledge. I have now two different types of data.frames I need to combine:
The data:
Dataframe A:
lists all transfusion dates by patient ID. Every transfusion is represented by a separate row, patients can have multiple transfusions. Different patients can have transfusions on the same date.
Patient ID Transfusion.Date
1 01/01/2000
1 01/30/2000
2 04/01/2003
3 04/01/2003
Dataframes of Type B contain test results at other dates, also by patient ID:
Patient ID Test.Date Test.Value
1 11/30/1999 negative
1 01/15/2000 700 copies/uL
1 01/27/2000 900 copies/uL
2 03/30/2003 negative
What I would like to have is Dataframe A with the same number of rows (1 for each transfusion), and with the most recent Test.Value as a separate column. Each transfusion date should have the test result from the test performed most closely (prior) to the transfusion.
desired output:
-->
Patient ID Transfusion.Date Pre.Transfusion.Test
1 01/01/2000 negative
1 01/30/2000 900 copies/ul
2 04/01/2003 negative
3 04/01/2003 NA
I think the general strategy would be to subset the data.frames by patient IDs. Then take all transfusion dates for patient 1, check which result is closest to all available test_dates for each element and then return the value closest.
How can I explain R to do that?
Edit 1: Here is the R-code for these examples
df_A <- data.frame(MRN = c(1,1,2,3),
Transfusion.Date = as.Date(c('01/01/2000', '01/30/2000',
'04/01/2003','04/01/2003'),'%m/%d/%Y'))
df_B <- data.frame(MRN = c(1,1,1,2),
Test.Date = as.Date(c('11/30/1999', '01/15/2000', '01/27/2000',
'03/30/2003'),'%m/%d/%Y'), Test.Result = c('negative',
'700 copies/ul','900 copies/ul','negative'))
Edit 2:
To clarify, the resulting data should be: Patient A received transfusions on Day X and Day Y. (for df_A). Prior to the transfusion on day X, his most recent test result was X (closest test date to first transfusion, in df_B). Prior to the transfusion on day Y, his most recent test result was Y (prior to the second transfusion, also in df_B. df_B also contains a bunch of other test dates, which are not needed for the final output.
Here's using data.table
's rolling joins:
require(data.table)
setkey(setDT(df_A), MRN, Transfusion.Date)
setkey(setDT(df_B), MRN, Test.Date)
df_B[df_A, roll=TRUE]
# MRN Test.Date Test.Result
# 1: 1 2000-01-01 negative
# 2: 1 2000-01-30 900 copies/ul
# 3: 2 2003-04-01 negative
# 4: 3 2003-04-01 NA
setDT
convertsdata.frame
todata.table
by reference (without any additional copying). That'll result indf_A
anddf_B
now being data.tables.setkey
sorts thedata.table
by the columns we provided, and marks those columns as key columns, which allows us to use binary search based joins.We perform a join of the form
x[i]
on the key columns, where for each row ofi
, the matching rows ofx
(if any, else NA) along withi
's rows are returned. This is what we call an equi-join. By addingroll = TRUE
, in the event of a mismatch, the last observation is carried forward (LOCF). This is what we call a rolling join. The sorting in increasing order (due tosetkey()
) ensures that the last observation is the most recent date.
HTH
这篇关于R:在数据帧B中填充行之前的日期使用来自数据帧A的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!