识别R中连续重叠的片段 [英] identify consecutively overlapping segments in R

查看：113 发布时间：2020/10/15 19:08:03 r data.table grouping overlap locf

本文介绍了识别R中连续重叠的片段的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要将重叠的细分汇总为一个所有关联细分的细分。

请注意，简单的活页夹无法检测非重叠但相连的线段之间的连接，有关说明请参见示例。如果会在我的地块上下雨，我正在寻找干燥的地面。

到目前为止，我已经通过迭代算法解决了这个问题，但我想知道是否有更优雅，更直接的方法解决此问题。我确定不是第一个遇到这种情况的人。

我在考虑非等价滚动联接，但未能实现该目标。

  library（data.table）
（x <-data.table（start = c（41,43,43,47,47， 48,51,52,54,55,57,59），
 end = c（42,44,45,53,48,50,52,55,57,56,58,60）））
 
＃开始结束
＃1：41 42 
＃2：43 44 
＃3：43 45 
＃4：47 53 
 ＃5：47 48 
＃6：48 50 
＃7：51 52 
＃8：52 55 
＃9：54 57 
＃10：55 56 
＃11：57 58 
＃12：59 60 
 
 setorder（x，start）[，i：= .I]＃我只是绘制线段的助手
 plot（NA，xlim = range（x [，。（start，end）]），ylim = rev（range（x $ i）））
 do.call（segments，list（x $ start ，x $ i，x $ end，x $ i））
 
x $ grp<-c（1,3,3,2,2,2,2,2,2,2,2,2 ，4）＃我要寻找的分组
 do.call（段，列表（x $ start，x $ i，x $ end，x $ i，col = x $ grp））
（ y<-x [，。（start = min（start）， end = max（end）），k = grp]）
 
＃grp start end 
＃1：1 41 42 
＃2：2 47 58 
＃ 3：3 43 45 
＃4：4 59 60 
 
 do.call（segments，list（y $ start，12.2，y $ end，12.2，col = 1：4，lwd = 3））

编辑：

太好了，谢谢cummax& cumsum做这项工作，Uwe's Answer稍好于Davids的评论。

end [.N] 可能会得到错误的结果，请尝试下面的示例数据 x 。
max（end）在所有情况下都是正确的，而且速度更快。

x<-data.table（开始= c（11866，12696，13813，14011，14041），结束= c（13140，14045，14051，14039，14045））

min（start）和 start [1L] 给出相同的结果（如 x 是按开始顺序排序的），后者会更快。

grp的运行速度明显更快，很不幸我需要分配grp。

cumsum（cummax（shift（end，fill = 0））< start）比 cumsum（c（0，start [-1L]> cummax（head（end，-1L）））））。

我没有测试软件包 GenomicRanges 解决方案。

解决方案

OP已要求将重叠的段汇总到一个由所有相连段组成的段中。

这是另一种使用的解决方案cummax（）和 cumsum（）来识别gro重叠或相邻片段的段数：

  x [order（start，end），grp：= cumsum（cummax（shift（end ，填充= 0））< start）] [
，。（start = min（start），end = max（end）），by = grp]

免责声明：我在SO的其他地方看到了这种聪明的方法，但我不记得确切的位置。

编辑：

I need to aggregate overlapping segments into a single segment ranging all connected segments.

Note that a simple foverlaps cannot detect connections between non overlapping but connected segments, see the example for clarification. If it would rain on my segments in my plot I am looking for the stretches of dry ground.

So far I solve this problem by an iterative algorithm but I'm wondering if there is a more elegant and stright forward way for this problem. I'm sure not the first one to face it.

I was thinking about a non-equi rolling join, but faild to implement that

library(data.table)
(x <- data.table(start = c(41,43,43,47,47,48,51,52,54,55,57,59), 
                  end = c(42,44,45,53,48,50,52,55,57,56,58,60)))

#     start end
#  1:    41  42
#  2:    43  44
#  3:    43  45
#  4:    47  53
#  5:    47  48
#  6:    48  50
#  7:    51  52
#  8:    52  55
#  9:    54  57
# 10:    55  56
# 11:    57  58
# 12:    59  60

setorder(x, start)[, i := .I] # i is just a helper for plotting segments
plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))

x$grp <- c(1,3,3,2,2,2,2,2,2,2,2,4) # the grouping I am looking for
do.call(segments, list(x$start, x$i, x$end, x$i, col = x$grp))
(y <- x[, .(start = min(start), end = max(end)), k=grp])

#    grp start end
# 1:   1    41  42
# 2:   2    47  58
# 3:   3    43  45
# 4:   4    59  60

do.call(segments, list(y$start, 12.2, y$end, 12.2, col = 1:4, lwd = 3))

EDIT:

That's brilliant, thanks, cummax & cumsum do the job, Uwe's Answer is slightly better than Davids comment.

end[.N] can get wrong results, try example data x below. max(end) is correct in all cases, and faster.

x <- data.table(start = c(11866, 12696, 13813, 14011, 14041), end = c(13140, 14045, 14051, 14039, 14045))
min(start) and start[1L] give the same (as x is ordered by start), the latter is faster.
grp on the fly is significantly faster, unfortunately I need grp assigned.
cumsum(cummax(shift(end, fill = 0)) < start) is significantly faster than cumsum(c(0, start[-1L] > cummax(head(end, -1L)))).
I did not test the package GenomicRanges solution.

解决方案

The OP has requested to aggregate overlapping segments into a single segment ranging all connected segments.

Here is another solution which uses cummax() and cumsum() to identify groups of overlapping or adjacent segments:

x[order(start, end), grp := cumsum(cummax(shift(end, fill = 0)) < start)][
  , .(start = min(start), end = max(end)), by = grp]

   grp start end
1:   1    41  42
2:   2    43  45
3:   3    47  58
4:   4    59  60

Disclaimer: I have seen that clever approach somewhere else on SO but I cannot remember exactly where.

Edit:

As David Arenburg has pointed out, it is not necessary to create the grp variable separately. This can be done on-the-fly in the by = parameter:

x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))]

Visualisation

OP's plot can be amended to show also the aggregated segments (quick and dirty):

plot(NA, xlim = range(x[,.(start,end)]), ylim = rev(range(x$i)))
do.call(segments, list(x$start, x$i, x$end, x$i))
x[order(start, end), .(start = min(start), end = max(end)), 
  by = .(grp = cumsum(cummax(shift(end, fill = 0)) < start))][
    , segments(start, grp + 0.5, end, grp + 0.5, "red", , 4)]

这篇关于识别R中连续重叠的片段的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

识别R中连续重叠的片段 [英] identify consecutively overlapping segments in R

问题描述

Visualisation

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

识别R中连续重叠的片段 [英] identify consecutively overlapping segments in R

问题描述

Visualisation

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭