R基于添加条件的特定列合并两个数据集 [英] R merge two datasets based on specific columns with added condition

查看:89
本文介绍了R基于添加条件的特定列合并两个数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Uwe和GKi的答案都是正确的。 Gki收到赏金是因为Uwe迟到了,但是Uwe的解决方案运行速度快15倍



我有两个数据集,其中包含多个患者的得分像这样测量时刻:

  df1<-data.frame( ID = c( Patient1, Patient1 ,患者1,患者1,患者2,患者3),
天 = c(0,25,235,353,100,538),
分数 = c(NA,2,3,4 ,5,6),
stringsAsFactors = FALSE)
df2<-data.frame( ID = c( Patient1, Patient1, Patient1, Patient1, Patient2 ,病人2,病人3),
天 = c(0,25,248,353,100,150,503),
分数 = c(1,10,3,4,5,7,6) ,
stringsAsFactors = FALSE)
> df1
ID天数得分
1病人1 0不适用
2病人1 25 2
3病人1 235 3
4病人1 353 4
5病人2 100 5
6患者3 538 6

> df2
ID天数得分
1患者1 0 1
2患者1 25 10
3患者1,248 3
4患者1 353 4
5患者2 100 5
6病人2 150 7
7病人3 503 6

ID 显示患者ID,列 Days 显示测量时刻(患者入院以来的天数)和列得分显示测得的分数。这两个数据集都显示相同的数据,但是显示的时间不同(df1是2年前,df2具有相同的数据,但今年以来的更新)。



我必须比较每个患者以及两个数据集之间每个时刻的得分。但是,在某些情况下, Days 变量会随时间变化很小,因此无法通过简单的联接比较数据集。示例:

  library(dplyr)

> full_join(df1,df2,by = c( ID,天数))%>%
+ ranging(。[[1]],as.numeric(。[[2]]))

ID Days Score.x Score.y
1患者1 0不适用1
2患者1 25 2 10
3患者1,235 3不适用
4患者1,248不适用3
5患者1 353 4 4
6患者2 100 5 5
7患者2 150 NA 7
8患者3503 NA 6
9患者3 538 6 NA

此处,第3行和第4行包含同一度量(得分为3)的数据,但未合并,因为列不同(235对248)。



问题: m寻找一种在第二列(例如30天)上设置阈值的方法,该阈值将导致以下输出:

  >阈值<-30 
> ***插入连接代码***

ID Days Score.x Score.y
1病人1 0不适用1
2病人1 25 2 10
3病人1 248 3 3
4病人1 353 4 4
5病人2 100 5 5
6病人2 150不适用7
7病人3 503不适用6
8病人3 538 6不适用

此输出显示先前输出的第3行和第4行已合并(因为248-235 < 30),并采用了第二个df的 Days 的值(248)。



要记住的三个主要条件是:




  • 连续的日期是在同一df(行1和2)内的阈值之内,不合并。

  • 在某些情况下,<$ c $最多四个值c> Days 变量存在于同一数据框中,因此不应合并。这些值之一可能确实存在于另一个数据帧的阈值中,并且这些值必须合并。参见下面示例中的第3行。

  • 每个分数/天数/患者组合只能使用一次。如果合并满足所有条件,但仍然有可能进行两次合并,则应使用第一个合并。



 > df1 
ID天数得分
1病人1 0 1
2病人1 5 2
3病人1 10 3
4病人1 15 4
5病人1 50 5

> df2
ID天数得分
1病人1 0 1
2病人1 5 2
3病人1 12 3
4病人1 15 4
5病人1 50 5

> df_combined
ID天数Score.x Score.y
1病人1 0 1 1
2病人1 5 2 2
3病人1 12 3 3
4病人1 15 4 4
5患者1 50 5 5

编辑CHINSOON12

 > df1 
ID天分数
1:病人1 0 1
2:病人1 116 2
3:病人1 225 3
4:病人1 309 4
5:病人1 351 5
6:病人2 0 6
7:病人2 49 7
> df2
ID天数分数
1:病人1 0 11
2:病人1 86 12
3:病人1 195 13
4:病人1 279 14
5:病人1 315 15
6:病人2 0 16
7:病人2 91 17
8:病人2 117 18

我将解决方案包装成这样的函数:

  testSO2< -function(DT1,DT2){
setDT(DT1); setDT(DT2)
names(DT1)<-c( ID, Days, X)
names(DT2)<-c( ID, Days, Y)
DT1 $ Days <-as.numeric(DT1 $ Days)
DT2 $ Days<-as .nu​​meric(DT2 $ Days)
DT1 [,c( s1, e1, s2, e2):=。(天-30L,天+ 30L,天,天)]
DT2 [,c( s1, e1, s2, e2):=。(天,天,天-30L,天+ 30L)]
byk<-c ( ID, s1, e1)
setkeyv(DT1,byk)
setkeyv(DT2,byk)
o1 <-foverlaps(DT1,DT2)

byk<-c( ID, s2, e2)
setkeyv(DT1,byk)
setkeyv(DT2,byk)
o2<-foverlaps(DT2,DT1)

olaps<-funion(o1,setcolorder(o2,names(o1)))[
is.na(Days),Days:= i.Days

结果<-olaps [,{
if(all(!is.na(Days))&& ; any(Days == i.Days)){
s<-.SD [Days == i.Days,。(Days = Days [1L],
X = X [1L],
Y = Y [1L])]
}否则{
s<-.SD [,。(Days = max(Days,i.Days),X,Y)]
}
unique(s)
},
keyby =。(ID,md = pmax(Days,i.Days))] [,md:= NULL] []
return(结果)
}

结果为:

 > testSO2(df1,df2)
ID天XY
1:病人1 0 1 11
2:病人1 116 2 12
3:病人1 225 3 13
4:病人1 309 4 14
5:病人1 315 4 15
6:病人1 351 5 NA
7:病人2 0 6 16
8:病人2 49 7 NA
9:病人2 91 NA 17
10:Patient2 117 NA 18

如您所见,第4行和5错了。 df1中分数的值使用了两次(4)。这些行周围的正确输出应如下所示,因为每个分数(在这种情况下为X或Y)只能使用一次:

  ID天XY 
4:病人1 309 4 14
5:病人1 315 NA 15
6:病人1 351 5 NA

以下数据框的代码。

 > dput(df1)
结构(列表(ID = c(患者1,患者1,患者1,患者1,
患者1,患者2,患者2)),天= c( 0, 116, 225,
309, 351, 0, 49),得分= 1:7),row.names = c(NA ,7L),class = data.frame)
> dput(df2)
结构(列表(ID = c( Patient1, Patient1, Patient1, Patient1,
Patient1, Patient2, Patient2, Patient2 ),天= c( 0,
86, 195, 279, 315, 0, 91, 117),得分= 11:18) ,row.names = c(NA,
8L),class = data.frame)


解决方案

迟到聚会了,这是一个使用完全外部联接和随后的行分组和聚集的解决方案

  library(data.table)
阈值<-30
#已满外部联接
m<-merge(setDT(df1)[,o:= 1L],setDT(df2)[,o:= 2L],
by = c( ID,天 ),all = TRUE)
#重新排序行
setorder(m,ID,Days)
#创建分组变量
m [,g:= rleid(ID,
cumsum(c(TRUE,diff(Days)> threshold)),
!is.na(ox)&!is.na(oy),
cumsum(c(TRUE,diff(fcoalesce) (ox,oy))== 0L))
)] [,g:= rleid(g,(rowid(g)-1L)%/%2)] []
#需要折叠行
m [,。(ID = last(ID) ,天数= last(天数),
Score.x = last(na.omit(Score.x)),
Score.y = last(na.omit(Score.y)))
,by = g] [,g:= NULL] []

对于OP的第一个测试用例我们得到


  ID Days Score.x Score.y 
1:Patient1 0不适用1
2:患者1 25 2 10
3:患者1 248 3 3
4:患者1 353 4 4
5:患者2 100 5 5
6:患者2 150不适用7
7:病人3 503不适用6
8:病人3 538 6不适用


符合预期。



使用其他用例进行验证



使用OP的第二项测试case

  df1<-data.table(ID = rep( Patient1,5L),天= c(0, 5,10,15,50),得分= 1:5)
df2<-data.tabl e(ID = rep( patient1,5L),天= c(0,5,12,15,50),得分= 1:5)

我们得到


  ID天得分。 x Score.y 
1:病人1 0 1 1
2:病人1 5 2 2
3:病人1 12 3 3
4:病人1 15 4 4
5: Patient1 50 5 5


使用OP的第三个测试用例(已使用讨论 chinsoon12的答案

  df1<-data.table(ID = paste0(患者,c(rep(1,5L),2,2)),
天= c(0,116,225,309, 351,0,49),得分= 1:7)
df2<-data.table(ID = paste0(患者,c(rep(1,5L),2,2,2)),
天= c(0,86,195,279,315,0,91,117),分数= 11:18)

我们得到


  ID天得分.x Score.y 
1:病人1 0 1 11
2:病人1 116 2 12
3:病人1 225 3 13
4:病人1 309 4 14
5:病人1 315 NA 15
6:患者1 351 5 NA
7:患者2 0 6 16
8:患者2 49 7 NA
9:患者2 91 NA 17
10:患者2 117 NA 18


如OP所预期(请参阅第5行



最后,我自己的测试用例在233到248之间有5个重叠天,以验证是否可以处理此案例

  df1<-data.table(ID = paste0(患者,c(rep(1,6L),2,3)),
天= c(0,1,25,235,237,353,100,538),
分数= c(NA,2:8))
df2<-data.table(ID = paste0(患者,c(rep( 1,6L),2,2,3)),
天= c(0,25,233,234,248,353,100,150,503),
Sc矿石= 11:19)

我们得到



< blockquote>

  ID Days Score.x Score.y 
1:Patient1 0不适用11#完全匹配
2:Patient1 1 2不适用#重叠但未折叠
3:病人1 25 3 12#完全匹配
4:病人1 233 NA 13#重叠但未折叠
5:病人1 235 4 14#重叠但未折叠
6:患者1 248 5 15#重叠,折叠
7:患者1 353 6 16#完全匹配
8:患者2 100 7 17#完全匹配
9:患者2 150 NA 18#不重叠
10:患者3 503不适用19#不重叠
11:患者3 538 8不适用#不重叠




说明



完全外部联接 merge(...,all = TRUE)查找相同ID和日期的完全匹配项,但



在加入之前,每个数据集都会获得一个附加列 o 来表示每个分数来源



结果是有序的,因为后续操作取决于



因此,使用我自己的测试用例,我们得到了

  m<-merge(setDT(df1)[,o:= 1L],setDT(df2)[,o:= 2L],
by = c( ID, Days) ,all = TRUE)
setorder(m,ID,Days)[]




  ID Days Score.x ox Score.y oy 
1:Patient1 0 NA 1 11 2
2:Patient1 1 2 1 NA NA
3:病人1 25 3 1 12 2
4:病人1 233不适用不适用13 2
5:病人1 234不适用不适用14 2
6:病人1 235 4 1不适用不适用
7:病人1 237 5 1不适用不适用
8:病人1 248不适用不适用15 2
9:病人1 353 6 1 16 2
10:病人2 100 7 1 17 2
11:病人2 150 NA NA 18 2
12:病人3 503 NA NA 19 2
13:Patient3 538 8 1不适用不适用


Now ,则使用 rleid()创建分组变量:

  m [ ,g:= rleid(ID,
cumsum(c(TRUE,diff(Days)>阈值)),
!is.na(o.x)& !is.na(oy),
cumsum(c(TRUE,diff(fcoalesce(ox,oy))== 0L))
)] [,g:= rleid(g,(rowid( g)-1L)%/%2)] []

当一个满足以下条件之一:




  • ID 更改

  • 在一个 ID 内,如果连续的之间的间隔超过30天(因此在ID中间隔不超过30天的行属于一个组,或者是重叠的)

  • 当行是直接匹配项时,

  • 当连续的行具有相同的原点时,从而标识交替原点的行的条纹,例如 1、2、1、2,... 2、1、2、1,...

  • ,最后,在上述条纹内,计算成对的交替成对的行,例如一个 df1 中的一行,然后是 df2 中的一行,或 df2 ,然后是 df1 中的一行。



最后一个条件尚未由OP明确说明,但这是我对


的解释。每个分数/天数/患者组合只能使用一次。如果合并满足所有条件,但仍有可能进行两次合并,则应使用
第一个合并。


它确保最多折叠两行,每行分别来自不同的数据集



分组后,我们得到


  ID Days Score.x ox Score.y oy g 
1:Patient1 0 NA 1 11 2 1
2:病人1 1 2 1不适用不适用2
3:病人1 25 3 1 12 2 3
4:病人1 233不适用13 2 4
5:病人1 234不适用14 2 5
6:患者1 235 4 1不适用不适用5
7:患者1,237 5 1不适用不适用6
8:患者1,248不适用15 2 6
9:患者1,353 6 1 16 2 7
10:病人2 100 7 1 17 2 8
11:病人2 150不适用不适用18 2 9
12:病人3 503不适用不适用19 2 10
13:病人3 538 8 1不适用不适用11


大多数组仅包含一个行,其中一些包含最后一行折叠的2行(按组汇总,返回所需的列并删除分组变量 g )。



改进的代码



按组聚合要求每个组只为每个列返回一个值(长度为1的向量)。 (否则,组结果将由多行组成。)为了简化起见,上述实现在所有4列上均使用 last()



last(Days)等效于 max(Days)因为数据集是有序的。



但是,如果我理解正确,OP倾向于从 df2返回 Days (尽管OP提到 max(Days)也可以接受)。



为了从 df2 返回值,需要修改聚合步骤:如果组大小 .N 大于1,我们从源自 df2 <的行中选择 Days 值/ code>,即 oy == 2

 #折叠需要的行
m [,。(ID = last(ID),
Days = last(if(.N> 1)Days [which(oy == 2)] else Days) ,
Score.x = last(na.omit(Score.x)),
Score.y = last(na.omit(Score.y)))
,= = g] [,g:= NULL] []

这将返回


  ID Days Score.x Score.y 
1:Patient1 0 NA 11
2:Patient1 1 2不适用
3:病人1 25 3 12
4:病人1 233不适用13
5:病人1 234 4 14
6:病人1 248 5 15
7:病人1 353 6 16
8:病人2 100 7 17
9:病人2 150 NA 18
10:病人3 503 NA 19
11:病人3 538 8 NA


现在第5行的值234已从 df2 中选取。



对于得分列使用 last()根本无关紧要,因为在2行的组中应该只有一个非NA值。因此, na.omit()应该只返回一个值,而 last()只是为了保持一致性。


Both Uwe's and GKi's answer are correct. Gki received the bounty because Uwe was late for that, but Uwe's solution runs about 15x as fast

I have two datasets that contain scores for different patients on multiple measuring moments like so:

df1 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient3"),
                  "Days" = c(0,25,235,353,100,538),
                  "Score" = c(NA,2,3,4,5,6), 
                  stringsAsFactors = FALSE)
df2 <- data.frame("ID" = c("patient1","patient1","patient1","patient1","patient2","patient2","patient3"),
                  "Days" = c(0,25,248,353,100,150,503),
                  "Score" = c(1,10,3,4,5,7,6), 
                  stringsAsFactors = FALSE)
> df1
        ID Days Score
1 patient1    0    NA
2 patient1   25     2
3 patient1  235     3
4 patient1  353     4
5 patient2  100     5
6 patient3  538     6

> df2
        ID Days Score
1 patient1    0     1
2 patient1   25    10
3 patient1  248     3
4 patient1  353     4
5 patient2  100     5
6 patient2  150     7
7 patient3  503     6

Column ID shows the patient ID, column Days shows the moment of measurement (Days since patient inclusion) and column Score shows the measured score. Both datasets show the same data but in different moments in time (df1 was 2 years ago, df2 has the same data with updates from this year).

I have to compare the scores for each patient and each moment between both datasets. However, in some cases the Days variable has minor changes over time, so comparing the dataset by a simple join does not work. Example:

library(dplyr)

> full_join(df1, df2, by=c("ID","Days")) %>% 
+   arrange(.[[1]], as.numeric(.[[2]]))

        ID Days Score.x Score.y
1 patient1    0      NA       1
2 patient1   25       2      10
3 patient1  235       3      NA
4 patient1  248      NA       3
5 patient1  353       4       4
6 patient2  100       5       5
7 patient2  150      NA       7
8 patient3  503      NA       6
9 patient3  538       6      NA

Here, rows 3 and 4 contain data for the same measurement (with score 3) but are not joined because the values for the Days column are different (235 vs 248).

Question: I'm looking for a way to set a threshold on the second column (say 30 days) which would result in the following output:

> threshold <- 30
> *** insert join code ***

        ID Days Score.x Score.y
1 patient1    0      NA       1
2 patient1   25       2      10
3 patient1  248       3       3
4 patient1  353       4       4
5 patient2  100       5       5
6 patient2  150      NA       7
7 patient3  503      NA       6
8 patient3  538       6      NA

This output shows that rows 3 and 4 of the previous output have been merged (because 248-235 < 30) and have taken the value for Days of the second df (248).

Three main conditions to keep in mind are:

  • Consecutive days that are within the threshold from within the same df (rows 1 and 2) are not merged.
  • In some cases, up to four values for the Days variable exist in the same dataframe and thus should not be merged. It might be the case that one of these values does exist within the treshold in the other dataframe, and these will have to be merged. See row 3 in the example below.
  • Each score/days/patient combination can only be used once. If a merge satisfies all conditions but there is still a double-merge possible, the first one should be used.

> df1
        ID Days Score
1 patient1    0     1
2 patient1    5     2
3 patient1   10     3
4 patient1   15     4
5 patient1   50     5

> df2
        ID Days Score
1 patient1    0     1
2 patient1    5     2
3 patient1   12     3
4 patient1   15     4
5 patient1   50     5

> df_combined
        ID Days Score.x Score.y
1 patient1    0       1       1
2 patient1    5       2       2
3 patient1   12       3       3
4 patient1   15       4       4
5 patient1   50       5       5

EDIT FOR CHINSOON12

> df1
          ID Days Score
 1: patient1    0     1
 2: patient1  116     2
 3: patient1  225     3
 4: patient1  309     4
 5: patient1  351     5
 6: patient2    0     6
 7: patient2   49     7
> df2
          ID Days Score
 1: patient1    0    11
 2: patient1   86    12
 3: patient1  195    13
 4: patient1  279    14
 5: patient1  315    15
 6: patient2    0    16
 7: patient2   91    17
 8: patient2  117    18

I wrapped your solution in a function like so:

testSO2 <- function(DT1,DT2) {
    setDT(DT1);setDT(DT2)
    names(DT1) <- c("ID","Days","X")
    names(DT2) <- c("ID","Days","Y")
    DT1$Days <- as.numeric(DT1$Days)
    DT2$Days <- as.numeric(DT2$Days)
    DT1[, c("s1", "e1", "s2", "e2") := .(Days - 30L, Days + 30L, Days, Days)]
    DT2[, c("s1", "e1", "s2", "e2") := .(Days, Days, Days - 30L, Days + 30L)]
    byk <- c("ID", "s1", "e1")
    setkeyv(DT1, byk)
    setkeyv(DT2, byk)
    o1 <- foverlaps(DT1, DT2)

    byk <- c("ID", "s2", "e2")
    setkeyv(DT1, byk)
    setkeyv(DT2, byk)
    o2 <- foverlaps(DT2, DT1)

    olaps <- funion(o1, setcolorder(o2, names(o1)))[
        is.na(Days), Days := i.Days]

    outcome <- olaps[, {
        if (all(!is.na(Days)) && any(Days == i.Days)) {
            s <- .SD[Days == i.Days, .(Days = Days[1L],
                                       X = X[1L],
                                       Y = Y[1L])]
        } else {
            s <- .SD[, .(Days = max(Days, i.Days), X, Y)]
        }
        unique(s)
    },
    keyby = .(ID, md = pmax(Days, i.Days))][, md := NULL][]
    return(outcome)
}

Which results in:

> testSO2(df1,df2)
          ID Days  X  Y
 1: patient1    0  1 11
 2: patient1  116  2 12
 3: patient1  225  3 13
 4: patient1  309  4 14
 5: patient1  315  4 15
 6: patient1  351  5 NA
 7: patient2    0  6 16
 8: patient2   49  7 NA
 9: patient2   91 NA 17
10: patient2  117 NA 18

As you can see, rows 4 and 5 are wrong. The value for Score in df1 is used twice (4). The correct output around those rows should be as follows, as each score (X or Y in this case) can only be used once:

          ID Days  X  Y
 4: patient1  309  4 14
 5: patient1  315 NA 15
 6: patient1  351  5 NA

Code for dataframes below.

> dput(df1)
structure(list(ID = c("patient1", "patient1", "patient1", "patient1", 
"patient1", "patient2", "patient2"), Days = c("0", "116", "225", 
"309", "351", "0", "49"), Score = 1:7), row.names = c(NA, 7L), class = "data.frame")
> dput(df2)
structure(list(ID = c("patient1", "patient1", "patient1", "patient1", 
"patient1", "patient2", "patient2", "patient2"), Days = c("0", 
"86", "195", "279", "315", "0", "91", "117"), Score = 11:18), row.names = c(NA, 
8L), class = "data.frame")

解决方案

Being late to the party, here is a solution which uses a full outer join with subsequent grouping and aggregation of rows according to OP's rules.

library(data.table)
threshold <- 30
# full outer join
m <- merge(setDT(df1)[, o := 1L], setDT(df2)[, o := 2L], 
           by = c("ID", "Days"), all = TRUE)
# reorder rows
setorder(m, ID, Days)
# create grouping variable
m[, g := rleid(ID,
               cumsum(c(TRUE, diff(Days) > threshold)),
               !is.na(o.x) & !is.na(o.y),
               cumsum(c(TRUE, diff(fcoalesce(o.x, o.y)) == 0L))
)][, g := rleid(g, (rowid(g) - 1L) %/% 2)][]
# collapse rows where required
m[, .(ID = last(ID), Days = last(Days), 
      Score.x = last(na.omit(Score.x)), 
      Score.y = last(na.omit(Score.y)))
  , by = g][, g := NULL][]

For OP's first test case we get

         ID Days Score.x Score.y
1: patient1    0      NA       1
2: patient1   25       2      10
3: patient1  248       3       3
4: patient1  353       4       4
5: patient2  100       5       5
6: patient2  150      NA       7
7: patient3  503      NA       6
8: patient3  538       6      NA

as expected.

Verification with other uses cases

With OP's 2nd test case

df1 <- data.table(ID = rep("patient1", 5L), Days = c(0, 5, 10, 15, 50), Score = 1:5)
df2 <- data.table(ID = rep("patient1", 5L), Days = c(0, 5, 12, 15, 50), Score = 1:5)

we get

         ID Days Score.x Score.y
1: patient1    0       1       1
2: patient1    5       2       2
3: patient1   12       3       3
4: patient1   15       4       4
5: patient1   50       5       5

With OP's 3rd test case (which was used to discus chinsoon12's answer)

df1 <- data.table(ID = paste0("patient", c(rep(1, 5L), 2, 2)), 
                  Days = c(0, 116, 225, 309, 351, 0, 49), Score = 1:7)
df2 <- data.table(ID = paste0("patient", c(rep(1, 5L), 2, 2, 2)), 
                  Days = c(0, 86, 195, 279, 315, 0, 91, 117), Score = 11:18)

we get

          ID Days Score.x Score.y
 1: patient1    0       1      11
 2: patient1  116       2      12
 3: patient1  225       3      13
 4: patient1  309       4      14
 5: patient1  315      NA      15
 6: patient1  351       5      NA
 7: patient2    0       6      16
 8: patient2   49       7      NA
 9: patient2   91      NA      17
10: patient2  117      NA      18

as expected by the OP (see row 5 in particular)

Finally, my own test case has 5 "overlapping days" between 233 and 248 to verify that this case will be treated

df1 <- data.table(ID = paste0("patient", c(rep(1, 6L), 2, 3)),
                  Days = c(0,1,25,235,237,353,100,538),
                  Score = c(NA, 2:8))
df2 <- data.table(ID = paste0("patient", c(rep(1, 6L), 2, 2, 3)),
                  Days = c(0, 25, 233, 234, 248, 353, 100, 150, 503),
                  Score = 11:19)

we get

          ID Days Score.x Score.y
 1: patient1    0      NA      11    # exact match
 2: patient1    1       2      NA    # overlapping, not collapsed
 3: patient1   25       3      12    # exact match
 4: patient1  233      NA      13    # overlapping, not collapsed
 5: patient1  235       4      14    # overlapping, collapsed
 6: patient1  248       5      15    # overlapping, collapsed
 7: patient1  353       6      16    # exact match
 8: patient2  100       7      17    # exact match
 9: patient2  150      NA      18    # not overlapping
10: patient3  503      NA      19    # not overlapping
11: patient3  538       8      NA    # not overlapping

Explanation

The full outer join merge(..., all = TRUE) finds exact matches on the same ID and day but includes all other rows from both datasets without matches.

Before joining, each dataset gets an additional column o to indicate the origin of each Score.

The result is ordered because the subsequent operations depend on the correct row order.

So, with my own test case we get

m <- merge(setDT(df1)[, o := 1L], setDT(df2)[, o := 2L], 
           by = c("ID", "Days"), all = TRUE)
setorder(m, ID, Days)[]

          ID Days Score.x o.x Score.y o.y
 1: patient1    0      NA   1      11   2
 2: patient1    1       2   1      NA  NA
 3: patient1   25       3   1      12   2
 4: patient1  233      NA  NA      13   2
 5: patient1  234      NA  NA      14   2
 6: patient1  235       4   1      NA  NA
 7: patient1  237       5   1      NA  NA
 8: patient1  248      NA  NA      15   2
 9: patient1  353       6   1      16   2
10: patient2  100       7   1      17   2
11: patient2  150      NA  NA      18   2
12: patient3  503      NA  NA      19   2
13: patient3  538       8   1      NA  NA

Now, a grouping variable is created using rleid():

m[, g := rleid(ID,
               cumsum(c(TRUE, diff(Days) > threshold)),
               !is.na(o.x) & !is.na(o.y),
               cumsum(c(TRUE, diff(fcoalesce(o.x, o.y)) == 0L))
)][, g := rleid(g, (rowid(g) - 1L) %/% 2)][]

The group counter is advanced, when one of the following conditions is met:

  • the ID changes
  • within an ID, when there is gap of more than 30 days between consecutive Days (so rows with a gap of 30 days or less within an ID belong to one group or are "overlapping")
  • when a row is a direct match,
  • when consecutive rows have the same origin, thereby identifying streaks of rows of alternating origin, e.g., 1, 2, 1, 2, ... or 2, 1, 2, 1, ...
  • and, finally, within above streaks, count pairs of rows of alternating origin, e.g., one row from df1 followed by a row from df2or one row from df2 followed by a row from df1.

The last condition has not been explicitely stated by the OP but is my interpretation of

Each score/days/patient combination can only be used once. If a merge satisfies all conditions but there is still a double-merge possible, the first one should be used.

It ensures that at most two rows, each from different datasets are being collapsed.

After grouping we get

          ID Days Score.x o.x Score.y o.y  g
 1: patient1    0      NA   1      11   2  1
 2: patient1    1       2   1      NA  NA  2
 3: patient1   25       3   1      12   2  3
 4: patient1  233      NA  NA      13   2  4
 5: patient1  234      NA  NA      14   2  5
 6: patient1  235       4   1      NA  NA  5
 7: patient1  237       5   1      NA  NA  6
 8: patient1  248      NA  NA      15   2  6
 9: patient1  353       6   1      16   2  7
10: patient2  100       7   1      17   2  8
11: patient2  150      NA  NA      18   2  9
12: patient3  503      NA  NA      19   2 10
13: patient3  538       8   1      NA  NA 11

Most of the groups contain only one row, a few contain 2 rows which are collapsed in the final step (aggregate by group, return the desired columns and remove the grouping variable g).

Improved code

Aggregating by group requires that for each group only one value (vector of length 1) is returned for each column. (Otherwise, the group result would consist of multiple rows.) The implementation above uses last() on all 4 columns for the sake of simplicity.

last(Days) is equivalent to max(Days) because the dataset is ordered.

However, if I understand correctly the OP prefers to return the Days value from df2 (although the OP has mentioned that max(Days) is acceptable as well).

In order to return the Days value from df2 the aggregation step needs to modified: If the group size .N is larger than 1, we pick the Days value from the row which originates from df2, i.e. where o.y == 2.

# collapse rows where required
m[, .(ID = last(ID), 
      Days = last(if (.N > 1) Days[which(o.y == 2)] else Days), 
      Score.x = last(na.omit(Score.x)), 
      Score.y = last(na.omit(Score.y)))
  , by = g][, g := NULL][]

This will return

          ID Days Score.x Score.y
 1: patient1    0      NA      11
 2: patient1    1       2      NA
 3: patient1   25       3      12
 4: patient1  233      NA      13
 5: patient1  234       4      14
 6: patient1  248       5      15
 7: patient1  353       6      16
 8: patient2  100       7      17
 9: patient2  150      NA      18
10: patient3  503      NA      19
11: patient3  538       8      NA

Now the Days value 234 in collapsed row 5 has been picked from df2.

For the Score columns the use of last() should not matter at all, because there should be only one non-NA value in a group of 2 rows. So, na.omit() should return only a single value and last() is just for consistency, probably.

这篇关于R基于添加条件的特定列合并两个数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆