根据两个列值的范围查找数据帧中的匹配间隔 [英] Find matching intervals in data frame by range of two column values

查看:138
本文介绍了根据两个列值的范围查找数据帧中的匹配间隔的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个时间相关事件的数据框。



以下是一个示例:

 名称事件顺序序列start_event end_event期限组
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
约翰5 C 147 226 79 ID1
约翰6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID2
ADAM 2 A 384 407 23 ID2
ADAM 3 B 0 79 79 ID2
ADAM 4 B 505 586 81 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2

基本上有两个不同的组,ID 1&对于每个这些团体,有18个不同的名字。每个人都出现3种不同的序列,A-C。然后,他们在这些序列中具有活动时间段,我标记开始/结束事件并计算持续时间。



我想要隔离每个人,并找到他们与相反和相同组ID的人有匹配的时间间隔。



使用上面的示例数据,我想在同一个序列中同时出现John和Adam。然后我想把John与ID1 / ID2中17个名字的其余部分进行比较。



不要需要匹配确切的共享活动时间量,我只是希望隔离常见的行。



我的舒适度正在使用dplyr,但我无法解决。我环顾四周,看到了一些类似的邻接矩阵示例,但是这些示例精确而精确。我无法确定一个范围/间隔的策略。



谢谢!



更新:
以下是所需结果的示例

 名称事件顺序序列start_event end_event duration组
JOHN 3 A 392 429 37 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 2 A 384 407 23 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2

我想你会隔离John的每个事件行,标记开始/结束时间框架,然后遍历每个d的剩余部分的名称和事件ata框架找到适合第一次在同一序列中的时间点,然后第二次反对约翰的长凳开始/结束时间框架。

解决方案

据我所知,您希望返回任何行,其中具有特定序列号的John的事件与事件重叠任何其他具有相同序列值的人。要实现这一点,您可以使用split-apply-combine按顺序拆分,识别重叠的行,然后重新组合:

  overlap<  -  function(start1,end1,start2,end2)pmin(end1,end2)> pmax(start2,start1)
do.call(rbind,lapply(split(dat,dat $ Sequence),function(x){
jpos < - (x $ Name ==JOHN )
njpos < - (x $ Name!=JOHN)
over< - outer(jpos,njpos,function(a,b){
overlap(x $ start_event [a],x $ end_event [a],x $ start_event [b],x $ end_event [b])
})
x [c(jpos [rowSums(over)> 0],njpos [colSums(over)> 0]),]
}))
#名称EventOrder序列start_event end_event持续时间组
#A.2 JOHN 2 A 60 112 52 ID1
#A.3 JOHN 3 A 392 429 37 ID1
#A.7 ADAM 1 A 19 75 56 ID2
#A.8 ADAM 2 A 384 407 23 ID2
#C.5 JOHN 5 C 147 226 79 ID1
#C.6 JOHN 6 C 566 611 45 ID1
#C.11 ADAM 5 C 140 205 65 ID2
# C.12 ADAM 6 C 522 599 77 ID2

请注意,我的输出包括两个不是的附加行在时间范围[60,112]的约束的问题 - 序列A中显示的序列A,其从时间范围[19,75]与Adam重叠序列A.



可以很容易地映射到 dplyr 语言:

  library(dplyr) 
overlap< - function(start1,end1,start2,end2)pmin(end1,end2)> pmax(start2,start1)
sliceRows< - function(name,start,end){
jpos< - 其中(name ==JOHN)
njpos <名称!=JOHN)
over< - outer(jpos,njpos,function(a,b)overlap(start [a],end [a],start [b],end [b]))
c(jpos [rowSums(over)> 0],njpos [colSums(over)> 0])
}
dat%>%
group_by >%
slice(sliceRows(Name,start_event,end_event))
#来源:本地数据框[8 x 7]
#组:序列[3]

#名称EventOrder序列start_event end_event持续时间组
#(fctr)(int)(fctr)(int)(int)(int)(fctr)
#1 JOHN 2 A 60 112 52 ID1
#2 JOHN 3 A 392 429 37 ID1
#3 ADAM 1 A 19 75 56 ID2
#4 ADAM 2 A 384 407 23 ID2
#5 JOHN 5 C 147 2 26 79 ID1
#6约翰6 C 566 611 45 ID1
#7 ADAM 5 C 140 205 65 ID2
#8 ADAM 6 C 522 599 77 ID2

如果您希望能够计算指定的一对用户的重叠,可以将操作包含在一个函数中指定要处理的用户对:

  overlap<  -  function(start1,end1,start2,end2)pmin(end1 ,end2)> pmax(start2,start1)
pair.overlap< - function(dat,user1,user2){
dat < - dat [dat $%c(user1,user2)中的% b $ b do.call(rbind,lapply(split(dat,dat $ Sequence),function(x){
jpos < - which(x $ Name == user1)
njpos<其中(x $ Name == user2)
over< - outer(jpos,njpos,function(a,b){
overlap(x $ start_event [a],x $ end_event [a] x $ start_event [b],x $ end_event [b])
})
x [c(jpos [rowSums(over)> 0],njpos [colSums(over)> 0] ]
}))
}

您可以使用 pair.overlap(dat,JOHN,ADAM)获取以前的输出。现在可以使用 combn 应用来实现每对用户的重叠:

  apply(combn(unique(as.character(dat $ Name)),2),2,function(x)pair.overlap(dat,x [ 1],x [2]))


I have a data frame of time related events.

Here is an example:

Name     Event Order     Sequence     start_event     end_event     duration     Group 
JOHN     1               A               0               19          19           ID1
JOHN     2               A               60              112         52           ID1  
JOHN     3               A               392             429         37           ID1  
JOHN     4               B               282             329         47           ID1
JOHN     5               C               147             226         79           ID1  
JOHN     6               C               566             611         45           ID1  
ADAM     1               A               19              75          56           ID2
ADAM     2               A               384             407         23           ID2  
ADAM     3               B               0               79          79           ID2  
ADAM     4               B               505             586         81           ID2
ADAM     5               C               140             205         65           ID2  
ADAM     6               C               522             599         77           ID2  

There are essentially two different groups, ID 1 & 2. For each of those groups, there are 18 different name's. Each of those people appear in 3 different sequences, A-C. They then have active time periods during those sequences, and I mark the start/end events and calculate the duration.

I'd like to isolate each person and find when they have matching time intervals with people in both the opposite and same group ID.

Using the example data above, I want to find when John and Adam appear during the same sequence, at the same time. I then want to compare John to the rest of the 17 names in ID1/ID2.

I do not need to match the exact amount of shared 'active' time, I just am hoping to isolate the rows that are common.

My comforts are in using dplyr, but I can't crack this yet. I looked around and saw some similar examples with adjacency matrices, but those are with precise and exact data points. I can't figure out the strategy with a range/interval.

Thank you!

UPDATE: Here is the example of the desired result

  Name     Event Order     Sequence     start_event     end_event     duration     Group 
    JOHN     3               A               392             429         37           ID1        
    JOHN     5               C               147             226         79           ID1  
    JOHN     6               C               566             611         45           ID1  
    ADAM     2               A               384             407         23           ID2  
    ADAM     5               C               140             205         65           ID2  
    ADAM     6               C               522             599         77           ID2  

I'm thinking you'd isolate each event row for John, mark the start/end time frame and then iterate through every name and event for the remainder of the data frame to find time points that fit first within the same sequence, and then secondly against the bench-marked start/end time frame of John.

解决方案

As I understand it, you want to return any row where an event for John with a particular sequence number overlaps an event for anybody else with the same sequence value. To achieve this, you could use split-apply-combine to split by sequence, identify the overlapping rows, and then re-combine:

overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
  jpos <- which(x$Name == "JOHN")
  njpos <- which(x$Name != "JOHN")
  over <- outer(jpos, njpos, function(a, b) {
    overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
  })
  x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
#      Name EventOrder Sequence start_event end_event duration Group
# A.2  JOHN          2        A          60       112       52   ID1
# A.3  JOHN          3        A         392       429       37   ID1
# A.7  ADAM          1        A          19        75       56   ID2
# A.8  ADAM          2        A         384       407       23   ID2
# C.5  JOHN          5        C         147       226       79   ID1
# C.6  JOHN          6        C         566       611       45   ID1
# C.11 ADAM          5        C         140       205       65   ID2
# C.12 ADAM          6        C         522       599       77   ID2

Note that my output includes two additional rows that are not shown in the question -- sequence A for John from time range [60, 112], which overlaps sequence A for Adam from time range [19, 75].

This could be pretty easily mapped into dplyr language:

library(dplyr)
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
sliceRows <- function(name, start, end) {
  jpos <- which(name == "JOHN")
  njpos <- which(name != "JOHN")
  over <- outer(jpos, njpos, function(a, b) overlap(start[a], end[a], start[b], end[b]))
  c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0])
}
dat %>%
  group_by(Sequence) %>%
  slice(sliceRows(Name, start_event, end_event))
# Source: local data frame [8 x 7]
# Groups: Sequence [3]
# 
#     Name EventOrder Sequence start_event end_event duration  Group
#   (fctr)      (int)   (fctr)       (int)     (int)    (int) (fctr)
# 1   JOHN          2        A          60       112       52    ID1
# 2   JOHN          3        A         392       429       37    ID1
# 3   ADAM          1        A          19        75       56    ID2
# 4   ADAM          2        A         384       407       23    ID2
# 5   JOHN          5        C         147       226       79    ID1
# 6   JOHN          6        C         566       611       45    ID1
# 7   ADAM          5        C         140       205       65    ID2
# 8   ADAM          6        C         522       599       77    ID2

If you wanted to be able to compute the overlaps for a specified pair of users, this could be done by wrapping the operation into a function that specifies the pair of users to be processed:

overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
pair.overlap <- function(dat, user1, user2) {
  dat <- dat[dat$Name %in% c(user1, user2),]
  do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
    jpos <- which(x$Name == user1)
    njpos <- which(x$Name == user2)
    over <- outer(jpos, njpos, function(a, b) {
      overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
    })
    x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
  }))
}

You could use pair.overlap(dat, "JOHN", "ADAM") to get the previous output. Generating the overlaps for every pair of users can now be done with combn and apply:

apply(combn(unique(as.character(dat$Name)), 2), 2, function(x) pair.overlap(dat, x[1], x[2]))

这篇关于根据两个列值的范围查找数据帧中的匹配间隔的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆