比较具有自定义功能的唯一对象列表 [英] Comparing list of unique objects with custom function

查看:64
本文介绍了比较具有自定义功能的唯一对象列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要比较数百个存储在唯一列表中的对象,以查找重复项:

I need to compare hundreds of objects stored in a unique list to find duplicates:

object_list = {Object_01, Object_02, Object_03, Object_04, Object_05, ...}

我编写了一个自定义函数,如果对象相等,则返回True,否则返回False:

I've written a custom function, which returns True, if the objects are equal and False if not:

object_01.compare(object_02)
>>> True

比较方法效果很好,但是每次执行花费很多时间.我目前正在使用itertools.combinations(x, 2)遍历所有组合.我认为使用dict来存储已比较的对象并动态创建新集是一个好主意,例如:

Compare method works well, but takes a lot of time per execution. I'm currently using itertools.combinations(x, 2) to iterate through all combinations. I've thought it's a good idea to use a dict for storing already compared objects and create new sets dynamically like:

dct = {'Compared': {}}
dct['Compared'] = set()

import itertools
for a, b in itertools.combinations(x, 2):

    if b.name not in dct['Compared']:

        if compare(a,b) == True:

            #print (a,b)
            key = a.name
            value = b.name

            if key not in dct:
                dct[key] = set()
                dct[key].add(value)
            else:
                dct[key].add(value)

            dct[key].add(key)

    dct['Compared'].add(b)

当前输出:

Compared: {'Object_02', 'Object_01', 'Object_03', 'Object_04', 'Object_05'}
Object_01: {'Object_02', 'Object_03', 'Object_01'}
Object_04: {'Object_05', 'Object_04'}
Object_05: {'Object_04'}
...

我想知道:是否有一种更快的方法来遍历所有组合以及如何中断/阻止已分配给对象的迭代重复列表?

I would like to know: Is there a faster way to iterate through all combinations and how to break/prevent the iteration of an object, which is already assigned to a list of duplicates?

所需的输出:

Compared: {'Object_02', 'Object_01', 'Object_03', 'Object_04', 'Object_05'}
Object_01: {'Object_02', 'Object_03', 'Object_01'}
Object_04: {'Object_05', 'Object_04'}
...

注意:比较方法是c-wrapper.要求是围绕它找到一种算法.

Note: Compare method is a c-wrapper. Requirement is to find an algorithm around it.

推荐答案

您不需要计算所有组合,只需检查给定的项目是否重复:

You don't need to calculate all combinations, you just need to check if a given item is a duplicate:

for i, a in enumerate(x):
    if any(a.compare(b) for b in x[:i]):
        # a is a duplicate of an already seen item, so do something

从技术上讲,这仍然是O(n ^ 2),但是您至少已经削减了所需的一半支票,并且应该快一些.

This is still technically O(n^2), but you've cut out at least half the checks required, and should be a bit faster.

简而言之,x[:i]返回索引i之前列表中的所有项目.如果项目x[i]出现在该列表中,则说明它是重复项.如果没有,列表中的之后可能会重复,但是到达那里时您会担心.

In short, x[:i] returns all items in the list before index i. If the item x[i] appears in that list, you know it's a duplicate. If not, there may be a duplicate after it in the list, but you worry about that when you get there.

在这里使用any也很重要:如果找到任何真实项目,它将立即停止,而无需检查其余可迭代项.

Using any is also important here: if it finds any true item, it will immediately stop, without checking the rest of the iterable.

您还可以通过从要检查的列表中删除已知的重复项来提高检查数量:

You could also improve the number of checks by removing known duplicates from the list you're checking against:

x_copy = x[:]
removed = 0
for i, a in enumerate(x):
    if any(a.compare(b) for b in x_copy[:i-removed]):
        del x_copy[i-removed]
        removed += 1
        # a is a duplicate of an already seen item, so do something

请注意,我们使用一个副本,以避免更改要迭代的序列,并且在使用索引时,我们需要考虑已删除的项目数.

Note that we use a copy, to avoid changing the sequence we're iterating over, and we need to take account for the number of items we've removed when using indexes.

接下来,我们只需要弄清楚如何构建字典即可.

Next, we just need to figure out how to build the dictionary.

这可能有点复杂.第一步是准确找出哪个元素是重复项.这可以通过实现any只是for循环的包装器来实现:

THis might be a little more complex. The first step is to figure out exactly which element is a duplicate. This can be done by realising any is just a wrapper around a for loop:

def any(iterable):
    for item in iterable:
        if item: return True
    return False    

然后我们可以进行一些小的更改,并传递一个函数:

We can then make a minor change, and pass in a function:

def first(iterable, fn):
    for item in iterable:
        if fn(item): return item     
    return None

现在,我们如下更改重复查找器:

Now, we change our duplicate finder as follows:

d = collections.defaultdict(list)

x_copy = x[:]
removed = 0
for i, a in enumerate(x):
    b = first(x_copy[:i-removed], a.compare):
    if b is not None:
        # b is the first occurring duplicate of a
        del x_copy[i-removed]
        removed += 1

        d[b.name].append(a)

     else:
         # we've not seen a yet, but might see it later
         d[a.name].append(a)

这会将列表中的每个元素放入dict(-like)中.如果您只想要重复项,那么这只是获取所有长度大于1的条目的一种情况.

This will put every element in the list into a dict(-like). If you only want the duplicates, it's then just a case of getting all the entries with a length greater than 1.

这篇关于比较具有自定义功能的唯一对象列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆