按组和年份之间的范围重叠/相交 [英] Range overlap/intersect by group and between years
问题描述
我有一个标记个体的列表(标记列),这些个体在河流范围(LocStart 和 LocEnd)内的不同年份(列Year)中被捕获.河上的位置以米为单位.
I have a list of marked individuals (column Mark) which have been captured various years (column Year) within a range of the river (LocStart and LocEnd). Location on the river is in meters.
我想知道某个标记的个体是否使用了年份之间的重叠范围,即该个体是否每年都去过河流的同一段.
I would like to know if a marked individual has used overlapping range between years i.e. if the individual has gone to the same segment of the river from year to year.
以下是原始数据集的示例:
Here is an example of the original data set:
IDMark
Year
LocStart
LocEnd
11081
1992
21,729
22,229
21081
1992
21,203
21,703
31081
2005
21,508
22,008
41126
1994
19,222
19,522
51126
1994
18,811
19,311
61283
2005
21,754
22,254
71283
2007
22025
22525
ID
Mark
Year
LocStart
LocEnd
11081
1992
21,729
22,229
21081
1992
21,203
21,703
31081
2005
21,508
22,008
41126
1994
19,222
19,522
51126
1994
18,811
19,311
61283
2005
21,754
22,254
71283
2007
22,025
22,525
这是我希望最终答案的样子:
Here is what I would like the final answer to look like:
MarkYear1
Year2
ID
10811992
2005
1、3
10811992
2005
2、3
12832005
2007
6, 7
Mark
Year1
Year2
IDs
10811992
2005
1, 3
10811992
2005
2, 3
12832005
2007
6, 7
在这种情况下,个人 1126 不会出现在最终输出中,因为仅有的两个可用范围是同一年.我意识到删除 Year1 = Year2 的所有记录会很容易.
In this case, individual 1126 would not be in the final output as the only two ranges available were the same year. I realize it would be easy to remove all the records where Year1 = Year2.
我想在 R 中执行此操作并查看了 >IRanges 包,但无法考虑 group = Mark 并能够提取 Year1 和 Year2 信息.
I would like to do this in R and have looked into the >IRanges package but have not been able to consider the group = Mark and been able to extract the Year1 and Year2 information.
推荐答案
使用 data.table
包中的 foverlaps()
函数:
Using foverlaps()
function from data.table
package:
require(data.table)
setkey(setDT(dt), Mark, LocStart, LocEnd) ## (1)
olaps = foverlaps(dt, dt, type="any", which=TRUE) ## (2)
olaps = olaps[dt$Year[xid] != dt$Year[yid]] ## (3)
olaps[, `:=`(Mark = dt$Mark[xid],
Year1 = dt$Year[xid],
Year2 = dt$Year[yid],
xid = dt$ID[xid],
yid = dt$ID[yid])] ## (4)
olaps = olaps[xid < yid] ## (5)
# xid yid Mark Year1 Year2
# 1: 2 3 1081 1992 2005
# 2: 1 3 1081 1992 2005
# 3: 6 7 1283 2005 2007
我们首先使用
setDT
通过引用将data.frame转换为data.table.然后,我们在Mark
、LocStart
和LocEnd
列上键data.table,这将允许我们执行重叠范围连接.
We first convert the data.frame to data.table by reference using
setDT
. Then, we key the data.table on columnsMark
,LocStart
andLocEnd
, which will allow us to perform overlapping range joins.
我们用任何类型的重叠计算自身重叠(dt
与自身).但是我们在这里使用 which = TRUE
返回匹配的索引.
We calculate self overlaps (dt
with itself) with any type of overlap. But we return matching indices here using which = TRUE
.
删除与 xid
和 yid
对应的 Year
相同的所有索引.
Remove all indices where Year
corresponding to xid
and yid
are identical.
添加所有其他列,并通过引用将 xid
和 yid
替换为相应的 ID
值.
Add all the other columns and replace xid
and yid
with corresponding ID
values, by reference.
删除 xid
>= yid
的所有索引.如果第 1 行与第 3 行重叠,则第 3 行也与第 1 行重叠.我们不需要两者.foverlaps()
默认情况下还没有办法删除它.
Remove all indices where xid
>= yid
. If row 1 overlaps with row 3, then row 3 also overlaps with row 1. We don't need both. foverlaps()
doesn't have a way to remove this by default yet.
这篇关于按组和年份之间的范围重叠/相交的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!