将范围元组列表折叠到重叠范围中 [英] Collapse a list of range tuples into the overlapping ranges

查看:26
本文介绍了将范围元组列表折叠到重叠范围中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找解决此问题的最节省内存的方法.

I'm looking for the most memory efficient way to solve this problem.

我有一个表示句子中部分字符串匹配的元组列表:

I have a list of tuples representing partial string matches in a sentence:

[(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]

每个元组的第一个值是匹配的开始位置,第二个值是长度.

The first value of each tuple is the start position for the match, the second value is the length.

这个想法是折叠列表,以便只报告最长的继续字符串匹配.在这种情况下,它将是:

The idea is to collapse the list so that only the longest continue string match is reported. In this case it would be:

[(0,4), (2,6), (22,6)]

我不想要最长的范围,比如找到最长非重叠序列的算法,但我想要所有范围折叠最长.

I do not want just the longest range, like in algorithm to find longest non-overlapping sequences, but I want all the ranges collapsed by the longest.

如果您想知道,我正在使用 Aho-Corasick 的纯 Python 实现将静态字典中的术语与给定的文本片段进行匹配.

In case your wondering, I am using a pure python implementation of the Aho-Corasick for matching terms in a static dictionary to the given text snippet.

由于这些元组列表的性质,重叠但不是独立的范围应该单独打印出来.例如,字典中有 betazzeta 两个词,betazeta 的匹配项是 [(0,5),(4,8)].由于这些范围重叠,但不包含在另一个范围内,因此答案应该是 [(0,5),(4,8)].我还修改了上面的输入数据集,以便涵盖这种情况.

Due to the nature of these tuple lists, overlapping but not self-contained ranges should be printed out individually. For example, having the words betaz and zeta in the dictionary, the matches for betazeta are [(0,5),(4,8)]. Since these ranges overlap, but none is contained in the other, the answer should be [(0,5),(4,8)]. I have also modified the input dataset above so that this case is covered.

谢谢!

推荐答案

import operator
lst = [(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
lst.sort(key=operator.itemgetter(1))
for i in reversed(xrange(len(lst)-1)):
    start, length = lst[i]
    for j in xrange(i+1, len(lst)):
        lstart, llength = lst[j]
        if start >= lstart and start + length <= lstart + llength:
            del lst[i]
            break
print lst
#[(0, 4), (2, 6), (22, 6)]

这篇关于将范围元组列表折叠到重叠范围中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆