识别 R 中连续重叠的段 [英] identify consecutively overlapping segments in R

查看:19
本文介绍了识别 R 中连续重叠的段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将重叠的段聚合成一个段,范围是所有连接的段.

请注意,简单的 foverlaps 无法检测非重叠但连接的段之间的连接,请参阅示例以进行说明.如果在我的地块上会下雨,我正在寻找干涸的土地.

到目前为止,我通过迭代算法解决了这个问题,但我想知道是否有更优雅和更直接的方法来解决这个问题.我肯定不是第一个面对它的人.

我在考虑非等值滚动连接,但未能实现

库(data.table)(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59),结束 = c(42,44,45,53,48,50,52,55,57,56,58,60)))# 开始结束# 1:41 42#2:43 44#3:43 45#4:47 53#5:47 48#6:48 50#7:51 52#8:52 55#9:54 57#10:55 56#11:57 58#12:59 60setorder(x, start)[, i := .I] # i 只是绘制线段的助手情节(NA,xlim =范围(x [,.(开始,结束)]),ylim = rev(范围(x$i)))do.call(segments, list(x$start, x$i, x$end, x$i))x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # 我要找的分组do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))(y <- x[, .(start = min(start), end = max(end)), k=grp])#grp 开始结束# 1:1 41 42#2:2 47 58# 3:3 43 45# 4:4 59 60do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))

太棒了,谢谢,cummax &cumsum 完成这项工作,Uwe 的答案比 Davids 的评论略好.

  • end[.N] 会得到错误的结果,试试下面的示例数据 x.max(end) 在所有情况下都是正确的,而且速度更快.

    x <- data.table(start = c(11866, 12696, 13813, 14011, 14041),end = c(13140, 14045, 14051, 14039, 14045))

  • min(start)start[1L] 给出相同的结果(因为 x 是按 start 排序的),后者更快.
  • grp on the fly 明显更快,不幸的是我需要分配 grp.
  • cumsum(cummax(shift(end, fill = 0)) < start) 明显快于 cumsum(c(0, start[-1L] > cummax(head(end, -1L)))).
  • 我没有测试包 GenomicRanges 解决方案.

解决方案

OP 已请求将重叠段聚合成一个包含所有连接段的单个段.

这是另一种解决方案,它使用 cummax()cumsum() 来识别重叠或相邻段的组:

x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][, .(start = min(start), end = max(end)), by = grp]

<块引用>

 grp start end1:1 41 422:2 43 453:3 47 584:4 59 60

免责声明:我在 SO 的其他地方看到过这种聪明的方法,但我不记得确切的位置.

编辑:

正如

I need to aggregate overlapping segments into a single segment ranging all connected segments.

Note that a simple foverlaps cannot detect connections between non overlapping but connected segments, see the example for clarification. If it would rain on my segments in my plot I am looking for the stretches of dry ground.

So far I solve this problem by an iterative algorithm but I'm wondering if there is a more elegant and stright forward way for this problem. I'm sure not the first one to face it.

I was thinking about a non-equi rolling join, but faild to implement that

library(data.table)
(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59), 
                  end = c(42,44,45,53,48,50,52,55,57,56,58,60)))

#     start end
#  1:    41  42
#  2:    43  44
#  3:    43  45
#  4:    47  53
#  5:    47  48
#  6:    48  50
#  7:    51  52
#  8:    52  55
#  9:    54  57
# 10:    55  56
# 11:    57  58
# 12:    59  60

setorder(x, start)[, i := .I] # i is just a helper for plotting segments
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))

x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # the grouping I am looking for
do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))
(y <- x[, .(start = min(start), end = max(end)), k=grp])

#    grp start end
# 1:   1    41  42
# 2:   2    47  58
# 3:   3    43  45
# 4:   4    59  60

do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))

EDIT:

That's brilliant, thanks, cummax & cumsum do the job, Uwe's Answer is slightly better than Davids comment.

  • end[.N] can get wrong results, try example data x below. max(end) is correct in all cases, and faster.

    x <- data.table(start = c(11866, 12696, 13813, 14011, 14041), end = c(13140, 14045, 14051, 14039, 14045))

  • min(start) and start[1L] give the same (as x is ordered by start), the latter is faster.
  • grp on the fly is significantly faster, unfortunately I need grp assigned.
  • cumsum(cummax(shift(end, fill = 0)) < start) is significantly faster than cumsum(c(0, start[-1L] > cummax(head(end, -1L)))).
  • I did not test the package GenomicRanges solution.

解决方案

The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.

Here is another solution which uses cummax() and cumsum() to identify groups of overlapping or adjacent segments:

x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
  , .(start = min(start), end = max(end)), by = grp]

   grp start end
1:   1    41  42
2:   2    43  45
3:   3    47  58
4:   4    59  60

Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.

Edit:

As David Arenburg has pointed out, it is not necessary to create the grp variable separately. This can be done on-the-fly in the by = parameter:

x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]

Visualisation

OP's plot can be amended to show also the aggregated segments (quick and dirty):

plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
    , segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]

这篇关于识别 R 中连续重叠的段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆