按组和年份之间的范围重叠/相交 [英] Range overlap/intersect by group and between years

查看:62
本文介绍了按组和年份之间的范围重叠/相交的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个标记个体的列表(标记列),这些个体在河流范围(LocStart 和 LocEnd)内的不同年份(列Year)中被捕获.河上的位置以米为单位.

I have a list of marked individuals (column Mark) which have been captured various years (column Year) within a range of the river (LocStart and LocEnd). Location on the river is in meters.

我想知道某个标记的个体是否使用了年份之间的重叠范围,即该个体是否每年都去过河流的同一段.

I would like to know if a marked individual has used overlapping range between years i.e. if the individual has gone to the same segment of the river from year to year.

以下是原始数据集的示例:

Here is an example of the original data set:

IDMark YearLocStartLocEnd
11081199221,72922,229
21081199221,20321,703
31081200521,50822,008
41126199419,22219,522
51126199418,81119,311
61283200521,75422,254
7128320072202522525

IDMark YearLocStartLocEnd
11081199221,72922,229
21081199221,20321,703
31081200521,50822,008
41126199419,22219,522
51126199418,81119,311
61283200521,75422,254
71283200722,02522,525

这是我希望最终答案的样子:

Here is what I would like the final answer to look like:

MarkYear1Year2ID
1081199220051、3
1081199220052、3
1283200520076, 7

MarkYear1Year2IDs
1081199220051, 3
1081199220052, 3
1283200520076, 7

在这种情况下,个人 1126 不会出现在最终输出中,因为仅有的两个可用范围是同一年.我意识到删除 Year1 = Year2 的所有记录会很容易.

In this case, individual 1126 would not be in the final output as the only two ranges available were the same year. I realize it would be easy to remove all the records where Year1 = Year2.

我想在 R 中执行此操作并查看了 >IRanges 包,但无法考虑 group = Mark 并能够提取 Year1 和 Year2 信息.

I would like to do this in R and have looked into the >IRanges package but have not been able to consider the group = Mark and been able to extract the Year1 and Year2 information.

推荐答案

使用 data.table 包中的 foverlaps() 函数:

Using foverlaps() function from data.table package:

require(data.table)
setkey(setDT(dt), Mark, LocStart, LocEnd)               ## (1)
olaps = foverlaps(dt, dt, type="any", which=TRUE)       ## (2)
olaps = olaps[dt$Year[xid] != dt$Year[yid]]             ## (3)
olaps[, `:=`(Mark  = dt$Mark[xid], 
             Year1 = dt$Year[xid],
             Year2 = dt$Year[yid],
             xid   = dt$ID[xid], 
             yid   = dt$ID[yid])]                       ## (4)
olaps = olaps[xid < yid]                                ## (5)
#    xid yid Mark Year1 Year2
# 1:   2   3 1081  1992  2005
# 2:   1   3 1081  1992  2005
# 3:   6   7 1283  2005  2007

  1. 我们首先使用setDT通过引用将data.frame转换为data.table.然后,我们在 MarkLocStartLocEnd 列上data.table,这将允许我们执行重叠范围连接.

  1. We first convert the data.frame to data.table by reference using setDT. Then, we key the data.table on columns Mark, LocStart and LocEnd, which will allow us to perform overlapping range joins.

我们用任何类型的重叠计算自身重叠(dt与自身).但是我们在这里使用 which = TRUE 返回匹配的索引.

We calculate self overlaps (dt with itself) with any type of overlap. But we return matching indices here using which = TRUE.

删除与 xidyid 对应的 Year 相同的所有索引.

Remove all indices where Year corresponding to xid and yid are identical.

添加所有其他列,并通过引用将 xidyid 替换为相应的 ID 值.

Add all the other columns and replace xid and yid with corresponding ID values, by reference.

删除 xid >= yid 的所有索引.如果第 1 行与第 3 行重叠,则第 3 行也与第 1 行重叠.我们不需要两者.foverlaps() 默认情况下还没有办法删除它.

Remove all indices where xid >= yid. If row 1 overlaps with row 3, then row 3 also overlaps with row 1. We don't need both. foverlaps() doesn't have a way to remove this by default yet.

这篇关于按组和年份之间的范围重叠/相交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆