R滚动连接两个具有连接误差容限的数据表 [英] R rolling join two data.tables with error margin on join

查看:70
本文介绍了R滚动连接两个具有连接误差容限的数据表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

注意:此问题是此条件,但措辞不同,并建议使用 data.table 代替 dplyr

Note: this question is a copy of this one but with different wording, and a suggestion for data.table instead of dplyr

我有两个数据集,分别包含多个测量时刻不同患者的得分,如下所示:

I have two datasets that contain scores for different patients on multiple measuring moments like so:

dt1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"),
                  "Days" = c(0,10,25,340,100,538),
                  "Score" = c(NA,2,3,99,5,6), 
                  stringsAsFactors = FALSE)
dt2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2","patient3"),
                  "Days" = c(0,10,25,353,100,150,503),
                  "Score" = c(1,10,3,4,5,7,6), 
                  stringsAsFactors = FALSE)

> dt1
        ID Days Score
1 patient1    0    NA
2 patient1   10     2
3 patient1   25     3
4 patient1  340    99
5 patient2  100     5
6 patient3  538     6

> dt2
        ID Days Score
1 patient1    0     1
2 patient1   10    10
3 patient1   25     3
4 patient1  353     4
5 patient2  100     5
6 patient2  150     7
7 patient3  503     6

天数是时间度量。我想基于 ID Days 加入两个数据集,前提是 Days 的阈值在阈值<-30 之内。有五个条件:

Column Days is the time measurement. I want to join both datasets based on ID and Days if the value for Days is within threshold <- 30. There are five conditions:


  • 同一df(行1和2)中阈值之内的连续天数不合并。 / li>
  • 在某些情况下,同一数据帧中最多存在四个Days变量值,因此不应合并。这些值之一可能确实存在于另一个数据帧的阈值中,并且这些值必须合并(第4行)。

  • 不在阈值之内的数据不应合并,也不应丢弃(请参见示例输出第7和8行)。

  • 如果 Days 在任何一个数据集中,都应填写NA。

  • 数据帧的长度 不是 ! li>
  • Consecutive days that are within the threshold from within the same df (rows 1 and 2) are not merged.
  • In some cases, up to four values for the Days variable exist in the same dataframe and thus should not be merged. It might be the case that one of these values does exist within the treshold in the other dataframe, and these will have to be merged (row 4).
  • Data that does not fall within treshold should not be merged, but not be discarded either (see example output row 7 and 8).
  • If there is no corresponding value for Days in either of the data sets, NA should be filled in.
  • The dataframes are not of equal length!

我怀疑 data.table滚动连接可以给我答案,但是我可以似乎没有弄清楚。预期的输出如下:

I suspect a data.table rolling join can give me the answer but I can't seem to figure it out. The expected output is as follows:

setDT(dt1)
setDT(dt2)
setkey(dt1, ID, Days) ?
setkey(dt2, ID, Days) ?

** do the join **

> dt_joined

        ID Days Score.x Score.y
1 patient1    0      NA       1
2 patient1   10       2      10
3 patient1   25       3       3
4 patient1  353      99       4   <<- merged (days 340 > 353)
5 patient2  100       5       5
6 patient2  150      NA       7   <<- new row added in dt2
7 patient3  503      NA       6   
8 patient3  538       6      NA   <<- same score as row 7 but not within treshold


$ b之内$ b

任何帮助将不胜感激。 data.table 解决方案不是强制性的。

Any help would be greatly appreciated. A data.table solution is not mandatory.

推荐答案

A <$ Uwe用户已经在此处给出了c $ c> data.table 答案:

https://stackoverflow.com/a/62321710/12079387

这篇关于R滚动连接两个具有连接误差容限的数据表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆