合并两个重叠列表的Pythonic方法,保留顺序 [英] Pythonic way to merge two overlapping lists, preserving order

查看:77
本文介绍了合并两个重叠列表的Pythonic方法,保留顺序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

好的,所以我有两个列表,例如:

Alright, so I have two lists, as such:

  • 它们可以并且将具有重叠的项目,例如,[1, 2, 3, 4, 5][4, 5, 6, 7].
  • 重叠中将有 not 个其他项目,例如,会出现 not 的情况:[1, 2, 3, 4, 5][3.5, 4, 5, 6, 7]
  • 列表不一定是有序的,也不是唯一的. [9, 1, 1, 8, 7][8, 6, 7].
  • They can and will have overlapping items, for example, [1, 2, 3, 4, 5], [4, 5, 6, 7].
  • There will not be additional items in the overlap, for example, this will not happen: [1, 2, 3, 4, 5], [3.5, 4, 5, 6, 7]
  • The lists are not necessarily ordered nor unique. [9, 1, 1, 8, 7], [8, 6, 7].

我想合并列表,以保留现有顺序,并合并到最后一个可能的有效位置,以确保没有数据丢失.此外,第一个列表可能很大.我当前的工作代码如下:

I want to merge the lists such that existing order is preserved, and to merge at the last possible valid position, and such that no data is lost. Additionally, the first list might be huge. My current working code is as such:

master = [1,3,9,8,3,4,5]
addition = [3,4,5,7,8]

def merge(master, addition):
    n = 1
    while n < len(master):
        if master[-n:] == addition[:n]:
            return master + addition[n:]
        n += 1
    return master + addition

我想知道的是-有更有效的方法吗?它可以工作,但是我对此持怀疑态度,因为它可以在我的应用程序中运行大型运行时-我正在合并大量字符串.

What I would like to know is - is there a more efficient way of doing this? It works, but I'm slightly leery of this, because it can run into large runtimes in my application - I'm merging large lists of strings.

我希望[1、3、9、8、3、4、5],[3、4、5、7、8]的合并为:[1、3、9、8 , 3,4,5 ,7,8].为了清楚起见,我突出显示了重叠部分.

I'd expect the merge of [1,3,9,8,3,4,5], [3,4,5,7,8] to be: [1,3,9,8,3,4,5,7,8]. For clarity, I've highlighted the overlapping portion.

[9、1、1、8、7],[8、6、7]应该合并为[9、1、1、8、7、8、6、7]

[9, 1, 1, 8, 7], [8, 6, 7] should merge to [9, 1, 1, 8, 7, 8, 6, 7]

推荐答案

这实际上并不是太困难.毕竟,基本上您要做的就是检查A末尾的哪个子字符串与B的哪个子字符串对齐.

This actually isn't too terribly difficult. After all, essentially all you're doing is checking what substring at the end of A lines up with what substring of B.

def merge(a, b):
    max_offset = len(b)  # can't overlap with greater size than len(b)
    for i in reversed(range(max_offset+1)):
        # checks for equivalence of decreasing sized slices
        if a[-i:] == b[:i]:
            break
    return a + b[i:]

我们可以通过以下方式对您的测试数据进行测试:

We can test with your test data by doing:

test_data = [{'a': [1,3,9,8,3,4,5], 'b': [3,4,5,7,8], 'result': [1,3,9,8,3,4,5,7,8]},
             {'a': [9, 1, 1, 8, 7], 'b': [8, 6, 7], 'result': [9, 1, 1, 8, 7, 8, 6, 7]}]

all(merge(test['a'], test['b']) == test['result'] for test in test_data)

这会遍历可能导致重叠的切片的所有可能组合,并且如果找到一个,则会记住重叠的结果.如果未找到任何内容,它将使用i的最后结果,该结果始终为0.无论哪种方式,它都会返回所有a以及b[i]之后的所有内容(在重叠情况下,这是非重叠部分.在非重叠情况下,则是所有内容)

This runs through every possible combination of slices that could result in an overlap and remembers the result of the overlap if one is found. If nothing is found, it uses the last result of i which will always be 0. Either way, it returns all of a plus everything past b[i] (in the overlap case, that's the non overlapping portion. In the non-overlap case, it's everything)

请注意,在极端情况下,我们可以进行一些优化.例如,这里最坏的情况是它遍历整个列表而没有找到任何解决方案.您可以在开始时添加快速检查,以免使最坏情况短路

Note that we can make a couple optimizations in corner cases. For instance, the worst case here is that it runs through the whole list without finding any solution. You could add a quick check at the beginning that might short circuit that worst case

def merge(a, b):
    if a[-1] not in b:
        return a + b
    ...

实际上,您可以使该解决方案更进一步,并且可能使您的算法更快

In fact you could take that solution one step further and probably make your algorithm much faster

def merge(a, b):
    while True:
        try:
            idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
        except ValueError:  # a[-1] not in b
            return a + b
        if a[-idx:] == b[:idx]:
            return a + b[:idx]

但是在以下情况下,这可能找不到最长的重叠时间:

However this might not find the longest overlap in cases like:

a = [1,2,3,4,1,2,3,4]
b = [3,4,1,2,3,4,5,6]
# result should be [1,2,3,4,1,2,3,4,5,6], but
# this algo produces [1,2,3,4,1,2,3,4,1,2,3,4,5,6]

您可以修复使用rindex而不是index来匹配最长片而不是最短片的问题,但是我不确定这对您的速度有什么影响.当然,它速度较慢,但​​可能无关紧要.您还可以记住结果并返回最短的结果,这可能是一个更好的主意.

You could fix that be using rindex instead of index to match the longest slice instead of the shortest, but I'm not sure what that does to your speed. It's certainly slower, but it might be inconsequential. You could also memoize the results and return the shortest result, which might be a better idea.

def merge(a, b):
    results = []
    while True:
        try:
            idx = b.index(a[-1]) + 1  # leftmost occurrence of a[-1] in b
        except ValueError:  # a[-1] not in b
            results.append(a + b)
            break
        if a[-idx:] == b[:idx]:
            results.append(a + b[:idx])
    return min(results, key=len)

应该这样做,因为合并最长的重叠应该在所有情况下产生最短的结果.

Which should work since merging the longest overlap should produce the shortest result in all cases.

这篇关于合并两个重叠列表的Pythonic方法,保留顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆