R:在数据帧B中填充行之前的日期使用来自数据帧A的值 [英] R: Using values from data frame A from a date prior to populate a row in data frame B

查看:95
本文介绍了R:在数据帧B中填充行之前的日期使用来自数据帧A的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这可能非常复杂,我怀疑需要高级知识。我现在有两种不同类型的数据。我需要组合:



数据:



数据帧A :



按患者ID列出所有输血日期。每次输血均由单独的一行表示,患者可以多次输血。不同的病人在同一天可以输血。

 患者ID Transfusion.Date 
1 01/01/2000
1 01/30/2000
2 04/01/2003
3 04/01/2003

B类包含其他日期的测试结果,也包括患者ID:

 患者ID Test.Date Test.Value 
1 11/30/1999负
1 01/15/2000 700份/ uL
1 01/27/2000 900份/ uL
2 03/30/2003负

我想要的是数据帧A具有相同的行数(每次输注1个),并且最近的Test.Value作为一个单独的列。每个输血日期应具有最接近输血的测试结果(之前)。



所需输出:



- >



< p $ p> 患者ID Transfusion.Date Pre.Transfusion.Test
1 01/01/2000负
1 01/30/2000 900份/ ul
2 04/01/2003负
3 04/01/2003 NA

我认为一般策略将是按照患者ID对数据框架进行子集。然后取病人1的所有输血日期,检查哪个结果最接近每个元素的所有可用测试日期,然后返回最接近的值。



我如何解释R



编辑1 :以下是这些示例的R代码

  df_A<  -  data.frame(MRN = c(1,1,2,3),
Transfusion.Date = as.Date(c('01 / 01 / 2000','01 / 30/2000',
'04 / 01/2003','04/01/2003'),'%m /%d /%Y'))

df_B< - data.frame(MRN = c(1,1,1,2),
Test.Date = as.Date(c('11 / 30/1999','01 / 15 / 2000','01 / 27/2000',
'03 / 30/2003'),'%m /%d /%Y'),Test.Result = c('negative',
'700份/ ul','900份/ ul','negative'))

编辑2:



为了澄清,结果数据应该是:患者A接受输血在第X天和第Y天(对于df_A)。在第X天输液之前,他最近的检测结果是X(最初的输血测试日期,在df_B中)。在第Y天输血之前,他最近的测试结果是Y(在第二次输血之前,也在df_B中,df_B还包含一些其他测试日期,这些测试日期并不是最终的输出。

解决方案

这里使用 data.table 的滚动联接:

  require(data.table)
setkey(setDT(df_A),MRN,Transfusion.Date)
setkey(setDT(df_B) ,MRN,Test.Date)

df_B [df_A,roll = TRUE]
#MRN Test.Date Test.Result
#1:1 2000-01-01 negative
#2:1 2000-01-30 900份/ ul
#3:2 2003-04-01负
#4:3 2003-04-01 NA




  • setDT convert data.frame to data.table 通过引用(没有任何额外的复制),这将导致 df_A df_B 现在是data.tables。


  • setkey 排序数据.table 由我们提供的列和标记这些列作为关键列,这允许我们使用基于二进制搜索的连接。


  • 我们在关键列上执行表单 x [i] 的连接,其中每行 i ,匹配的行 x (如果有的话,其他NA)以及 i 的行被返回。这就是我们所说的均衡加入。通过添加 roll = TRUE ,如果不匹配,最后一次观察结束(LOCF)。这就是我们所说的滚动加入。以递增顺序排序(由于 setkey())确保最后一次观察是最近的日期。




    • HTH


      This may be very complicated and I suspect requires advanced knowledge. I have now two different types of data.frames I need to combine:

      The data:

      Dataframe A:

      lists all transfusion dates by patient ID. Every transfusion is represented by a separate row, patients can have multiple transfusions. Different patients can have transfusions on the same date.

      Patient ID Transfusion.Date
      1          01/01/2000
      1          01/30/2000
      2          04/01/2003
      3          04/01/2003
      

      Dataframes of Type B contain test results at other dates, also by patient ID:

      Patient ID  Test.Date   Test.Value
      1           11/30/1999   negative
      1           01/15/2000   700 copies/uL
      1           01/27/2000   900 copies/uL
      2           03/30/2003   negative
      

      What I would like to have is Dataframe A with the same number of rows (1 for each transfusion), and with the most recent Test.Value as a separate column. Each transfusion date should have the test result from the test performed most closely (prior) to the transfusion.

      desired output:

      -->

      Patient ID Transfusion.Date Pre.Transfusion.Test
      1          01/01/2000       negative
      1          01/30/2000       900 copies/ul
      2          04/01/2003       negative
      3          04/01/2003       NA
      

      I think the general strategy would be to subset the data.frames by patient IDs. Then take all transfusion dates for patient 1, check which result is closest to all available test_dates for each element and then return the value closest.

      How can I explain R to do that?

      Edit 1: Here is the R-code for these examples

      df_A <- data.frame(MRN = c(1,1,2,3), 
                         Transfusion.Date = as.Date(c('01/01/2000', '01/30/2000', 
                         '04/01/2003','04/01/2003'),'%m/%d/%Y')) 
      
      df_B <- data.frame(MRN = c(1,1,1,2), 
                         Test.Date = as.Date(c('11/30/1999', '01/15/2000', '01/27/2000', 
                         '03/30/2003'),'%m/%d/%Y'), Test.Result = c('negative', 
                         '700 copies/ul','900 copies/ul','negative'))
      

      Edit 2:

      To clarify, the resulting data should be: Patient A received transfusions on Day X and Day Y. (for df_A). Prior to the transfusion on day X, his most recent test result was X (closest test date to first transfusion, in df_B). Prior to the transfusion on day Y, his most recent test result was Y (prior to the second transfusion, also in df_B. df_B also contains a bunch of other test dates, which are not needed for the final output.

      解决方案

      Here's using data.table's rolling joins:

      require(data.table)
      setkey(setDT(df_A), MRN, Transfusion.Date)
      setkey(setDT(df_B), MRN, Test.Date)
      
      df_B[df_A, roll=TRUE]
      #    MRN  Test.Date   Test.Result
      # 1:   1 2000-01-01      negative
      # 2:   1 2000-01-30 900 copies/ul
      # 3:   2 2003-04-01      negative
      # 4:   3 2003-04-01            NA
      

      • setDT converts data.frame to data.table by reference (without any additional copying). That'll result in df_A and df_B now being data.tables.

      • setkey sorts the data.table by the columns we provided, and marks those columns as key columns, which allows us to use binary search based joins.

      • We perform a join of the form x[i] on the key columns, where for each row of i, the matching rows of x (if any, else NA) along with i's rows are returned. This is what we call an equi-join. By adding roll = TRUE, in the event of a mismatch, the last observation is carried forward (LOCF). This is what we call a rolling join. The sorting in increasing order (due to setkey()) ensures that the last observation is the most recent date.

      HTH

      这篇关于R:在数据帧B中填充行之前的日期使用来自数据帧A的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆