将重叠的数字范围合并为连续范围 [英] Merge overlapping numeric ranges into continuous ranges

查看:98
本文介绍了将重叠的数字范围合并为连续范围的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将基因组坐标范围合并为连续范围,并提供用于跨缺口合并的附加选项。

I am trying to merge a range of genomic coordinates into continuous ranges, with an additional option for merging across gaps.

例如,如果我具有基因组范围 [[0,1000],[5,1100]] 我希望结果为 [0,1100] 。如果将offset选项设置为 100 ,并且输入为 [[0,1000],[1090,1000]] 我再次希望结果为 [0,1100]

For example, if I had the genomic ranges [[0, 1000], [5, 1100]] I would want the result to be [0, 1100]. If the offset option was set to 100, and the input was [[0, 1000], [1090, 1000]] I would once again want the result to be [0, 1100].

我已经实现了一种方法这样做会顺序地进行对齐,并尝试在上一个结束位置和下一个开始位置上合并,但是失败了,因为实际结果的长度不同。例如,我的列表中有结果 [[138,821],[177,1158],[224,905],[401,1169]] 开始位置。答案应该是 [138,1169] ,但是我却得到了 [[138,1158],[177,905],[224, 1169]] 。显然,我不仅需要考虑上一个结局和下一个起点,还需要更多考虑,但是我没有找到一个好的解决方案(最好是一个不是if语句嵌套的解决方案)。有人有什么建议吗?

I have implemented a way of doing this that steps through the alignments sequentially and tries to merge on the previous ending position and next starting position, but it fails because the actual results have varying lengths. For example I have the results [[138, 821],[177, 1158], [224, 905], [401, 1169]] in my list sorted by the start positions. The answer to that should be [138, 1169] but I instead get [[138, 1158], [177, 905], [224, 1169]]. Obviously I need to take more into account than just the previous ending and the next start, but I haven't found a good solution (preferably one that isn't a huge nest of if statements). Anyone have any suggestions?

def overlap_alignments(align, gene, overlap):
    #make sure alignments are sorted first by chromosome then by start pos on chrom
    align = sorted(align, key = lambda x: (x[0], x[1]))
    merged = list()
    for i in xrange(1, len(align)):
        prv, nxt = align[i-1], align[i]
        if prv[0] == nxt[0] and prv[2] + overlap >= nxt[1]:
            start, end = prv[1], nxt[2]
            chrom = prv[0]
            merged.append([chrom, start, end, gene])
    return merged


推荐答案

那么,如何跟踪每个开始和结束以及每个位置所属的范围数量呢?

Well, how about keeping track of every start and end and the number of ranges where each position belongs to?

def overlap_alignments(align, overlap):
    # create a list of starts and ends
    stends = [ (a[0], 1) for a in align ]
    stends += [ (a[1] + overlap, -1) for a in align ]
    stends.sort(key=lambda x: x[0])

    # now we should have a list of starts and ends ordered by position,
    # e.g. if the ranges are 5..10, 8..15, and 12..13, we have
    # (5,1), (8,1), (10,-1), (12,1), (13,-1), (15,-1)

    # next, we form a cumulative sum of this
    s = 0
    cs = []
    for se in stends:
        s += se[1]
        cs.append((se[0], s))
    # this is, with the numbers above, (5,1), (8,2), (10,1), (12,2), (13,1), (15,0)
    # so, 5..8 belongs to one range, 8..10 belongs to two overlapping range,
    # 10..12 belongs to one range, etc

    # now we'll find all contiguous ranges
    # when we traverse through the list of depths (number of overlapping ranges), a new
    # range starts when the earlier number of overlapping ranges has been 0
    # a range ends when the new number of overlapping ranges is zero 
    prevdepth = 0
    start = 0
    combined = []
    for pos, depth in cs:
        if prevdepth == 0:
            start = pos
        elif depth == 0
            combined.append((start, pos-overlap))
        prevdepth = depth

    return combined

绘制起来比解释容易。 (是的,可以将总和写在较短的空间中,但我发现这种方式更清楚。)

This would be easier to draw than to explain. (And yes, the cumulative sum could be written in a shorter space, but i find it clearer this way.)

为以图形方式说明这一点,让我们输入([5 ,10],[8,15],[12,13],[16,20])和重叠= 1。

To explain this graphically, lets take input ([5,10],[8,15],[12,13],[16,20]) and overlap=1.

.....XXXXXo.............. (5-10)
........XXXXXXXo......... (8-15)
............Xo........... (12-13)
................XXXXo.... (16-20)
.....1112221221111111.... number of ranges at each position
.....----------------.... number of ranges > 0
.....---------------..... overlap corrected (5-20)

这篇关于将重叠的数字范围合并为连续范围的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆