识别R中连续重叠的片段 [英] identify consecutively overlapping segments in R
问题描述
我需要将重叠的细分汇总为一个所有关联细分的细分。
请注意,简单的活页夹无法检测非重叠但相连的线段之间的连接,有关说明请参见示例。如果会在我的地块上下雨,我正在寻找干燥的地面。
到目前为止,我已经通过迭代算法解决了这个问题,但我想知道是否有更优雅,更直接的方法解决此问题。我确定不是第一个遇到这种情况的人。
我在考虑非等价滚动联接,但未能实现该目标。
library(data.table)
(x <-data.table(start = c(41,43,43,47,47, 48,51,52,54,55,57,59),
end = c(42,44,45,53,48,50,52,55,57,56,58,60)))
#开始结束
#1:41 42
#2:43 44
#3:43 45
#4:47 53
#5:47 48
#6:48 50
#7:51 52
#8:52 55
#9:54 57
#10:55 56
#11:57 58
#12:59 60
setorder(x,start)[,i:= .I]#我只是绘制线段的助手
plot(NA,xlim = range(x [,。(start,end)]),ylim = rev(range(x $ i)))
do.call(segments,list(x $ start ,x $ i,x $ end,x $ i))
x $ grp<-c(1,3,3,2,2,2,2,2,2,2,2,2 ,4)#我要寻找的分组
do.call(段,列表(x $ start,x $ i,x $ end,x $ i,col = x $ grp))
( y<-x [,。(start = min(start), end = max(end)),k = grp])
#grp start end
#1:1 41 42
#2:2 47 58
# 3:3 43 45
#4:4 59 60
do.call(segments,list(y $ start,12.2,y $ end,12.2,col = 1:4,lwd = 3))
编辑:
太好了,谢谢cummax& cumsum做这项工作,Uwe's Answer稍好于Davids的评论。
-
end [.N]
可能会得到错误的结果,请尝试下面的示例数据x
。
max(end)
在所有情况下都是正确的,而且速度更快。
x<-data.table(开始= c(11866,12696,13813,14011,14041),
结束= c(13140,14045,14051,14039,14045)) -
min(start)
和start [1L]
给出相同的结果(如x
是按开始顺序排序的),后者会更快。 - grp的运行速度明显更快,很不幸我需要分配grp。
-
cumsum(cummax(shift(end,fill = 0))< start)
比cumsum(c(0,start [-1L]> cummax(head(end,-1L)))))
。 - 我没有测试软件包 GenomicRanges 解决方案。
OP已要求将重叠的段汇总到一个由所有相连段组成的段中。
这是另一种使用的解决方案cummax()
和 cumsum()
来识别gro重叠或相邻片段的段数:
x [order(start,end),grp:= cumsum(cummax(shift(end ,填充= 0))< start)] [
,。(start = min(start),end = max(end)),by = grp]
grp开始结束
1:1 41 42
2:2 43 45
3:3 47 58
4:4 59 60
免责声明:我在SO的其他地方看到了这种聪明的方法,但我不记得确切的位置。
编辑:
I need to aggregate overlapping segments into a single segment ranging all connected segments.
Note that a simple foverlaps cannot detect connections between non overlapping but connected segments, see the example for clarification. If it would rain on my segments in my plot I am looking for the stretches of dry ground.
So far I solve this problem by an iterative algorithm but I'm wondering if there is a more elegant and stright forward way for this problem. I'm sure not the first one to face it.
I was thinking about a non-equi rolling join, but faild to implement that
library(data.table)
(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59),
end = c(42,44,45,53,48,50,52,55,57,56,58,60)))
# start end
# 1: 41 42
# 2: 43 44
# 3: 43 45
# 4: 47 53
# 5: 47 48
# 6: 48 50
# 7: 51 52
# 8: 52 55
# 9: 54 57
# 10: 55 56
# 11: 57 58
# 12: 59 60
setorder(x, start)[, i := .I] # i is just a helper for plotting segments
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # the grouping I am looking for
do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))
(y <- x[, .(start = min(start), end = max(end)), k=grp])
# grp start end
# 1: 1 41 42
# 2: 2 47 58
# 3: 3 43 45
# 4: 4 59 60
do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))
EDIT:
That's brilliant, thanks, cummax & cumsum do the job, Uwe's Answer is slightly better than Davids comment.
end[.N]
can get wrong results, try example datax
below.max(end)
is correct in all cases, and faster.x <- data.table(start = c(11866, 12696, 13813, 14011, 14041), end = c(13140, 14045, 14051, 14039, 14045))
min(start)
andstart[1L]
give the same (asx
is ordered by start), the latter is faster.- grp on the fly is significantly faster, unfortunately I need grp assigned.
cumsum(cummax(shift(end, fill = 0)) < start)
is significantly faster thancumsum(c(0, start[-1L] > cummax(head(end, -1L))))
.- I did not test the package GenomicRanges solution.
The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.
Here is another solution which uses cummax()
and cumsum()
to identify groups of overlapping or adjacent segments:
x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
, .(start = min(start), end = max(end)), by = grp]
grp start end 1: 1 41 42 2: 2 43 45 3: 3 47 58 4: 4 59 60
Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.
Edit:
As David Arenburg has pointed out, it is not necessary to create the grp
variable separately. This can be done on-the-fly in the by =
parameter:
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]
Visualisation
OP's plot can be amended to show also the aggregated segments (quick and dirty):
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)),
by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
, segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]
这篇关于识别R中连续重叠的片段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!