如何有效地找到两个列表中匹配元素的索引 [英] How to efficiently find the indices of matching elements in two lists

查看:104
本文介绍了如何有效地找到两个列表中匹配元素的索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理两个大数据集,我的问题如下.

I am working on two large data sets, and my question is as follows.

假设我有两个列表:

list1 = [A,B,C,D]

list2 = [B,D,A,G]

除O(n 2 )搜索外,如何使用Python如何有效地找到匹配索引?结果应如下所示:

How can I efficiently find the matching index, using Python, other than O(n2) searching? The result should look like:

matching_index(list1,list2) -> [(0,2),(1,0),(3,1)]

推荐答案

无重复

如果对象是可哈希对象,并且列表没有重复项,则可以创建第一个列表的反向索引,然后遍历第二个列表.该列表仅遍历每个列表一次,因此为O(n).

def find_matching_index(list1, list2):

    inverse_index = { element: index for index, element in enumerate(list1) }

    return [(index, inverse_index[element])
        for index, element in enumerate(list2) if element in inverse_index]

find_matching_index([1,2,3], [3,2,1]) # [(0, 2), (1, 1), (2, 0)]

有重复项

您可以扩展先前的解决方案以解决重复项.您可以使用set跟踪多个索引.

With duplicates

You can extend the previous solution to account for duplicates. You can keep track of multiple indices with a set.

def find_matching_index(list1, list2):

    # Create an inverse index which keys are now sets
    inverse_index = {}

    for index, element in enumerate(list1):

        if element not in inverse_index:
            inverse_index[element] = {index}

        else:
            inverse_index[element].add(index)

    # Traverse the second list    
    matching_index = []

    for index, element in enumerate(list2):

        # We have to create one pair by element in the set of the inverse index
        if element in inverse_index:
            matching_index.extend([(x, index) for x in inverse_index[element]])

    return matching_index

find_matching_index([1, 1, 2], [2, 2, 1]) # [(2, 0), (2, 1), (0, 2), (1, 2)]

不幸的是,这不再是 O(n).考虑输入[1, 1][1, 1]的情况,输出为[(0, 0), (0, 1), (1, 0), (1, 1)].因此,根据输出的大小,最坏的情况不能比O(n^2)更好.

Unfortunately, this is no longer O(n). Consider the case where you input [1, 1] and [1, 1], the output is [(0, 0), (0, 1), (1, 0), (1, 1)]. Thus by the size of the output, the worst case cannot be better than O(n^2).

尽管如此,如果没有重复项,此解决方案仍然是O(n).

Although, this solution is still O(n) if there are no duplicates.

现在出现了您的对象不可哈希但可比较的情况.这里的想法是对列表进行排序,以保留每个元素的原始索引.然后我们可以对等于获得匹配索引的元素序列进行分组.

Now comes the case where your objects are not hashable, but comparable. The idea here will be to sort your lists in a way that preserves the origin index of each element. Then we can group sequences of elements that are equal to get matching indices.

由于在下面的代码中大量使用了groupbyproduct,因此我使find_matching_index返回一个生成器,以提高长列表上的内存效率.

Since we make heavy use of groupby and product in the following code, I made find_matching_index return a generator for memory efficiency on long lists.

from itertools import groupby, product

def find_matching_index(list1, list2):
    sorted_list1 = sorted((element, index) for index, element in enumerate(list1))
    sorted_list2 = sorted((element, index) for index, element in enumerate(list2))

    list1_groups = groupby(sorted_list1, key=lambda pair: pair[0])
    list2_groups = groupby(sorted_list2, key=lambda pair: pair[0])

    for element1, group1 in list1_groups:
        try:
            element2, group2 = next(list2_groups)
            while element1 > element2:
                (element2, _), group2 = next(list2_groups)

        except StopIteration:
            break

        if element2 > element1:
            continue

        indices_product = product((i for _, i in group1), (i for _, i in group2), repeat=1)

        yield from indices_product

        # In version prior to 3.3, the above line must be
        # for x in indices_product:
        #     yield x

list1 = [[], [1, 2], []]
list2 = [[1, 2], []]

list(find_matching_index(list1, list2)) # [(0, 1), (2, 1), (1, 0)]

事实证明,时间复杂度不会受到太大影响.排序当然需要O(n log(n)),但是groupby提供的生成器可以通过仅遍历我们的列表两次来恢复所有元素.结论是,我们的复杂性主要受product输出的大小限制.因此,给出算法为O(n log(n))的最佳情况,而算法再次为O(n^2)的最坏情况.

It turns out that time complexity does not suffer that much. Sorting of course takes O(n log(n)), but then groupby provides generators that can recover all elements by traversing our lists only twice. The conclusion is that our complexity is primarly bound by the size of the output of product. Thus giving a best case where the algorithm is O(n log(n)) and a worst case that is once again O(n^2).

这篇关于如何有效地找到两个列表中匹配元素的索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆