比较具有自定义功能的唯一对象列表 [英] Comparing list of unique objects with custom function
问题描述
我需要比较数百个存储在唯一列表中的对象,以查找重复项:
I need to compare hundreds of objects stored in a unique list to find duplicates:
object_list = {Object_01, Object_02, Object_03, Object_04, Object_05, ...}
我编写了一个自定义函数,如果对象相等,则返回True
,否则返回False
:
I've written a custom function, which returns True
, if the objects are equal and False
if not:
object_01.compare(object_02)
>>> True
比较方法效果很好,但是每次执行花费很多时间.我目前正在使用itertools.combinations(x, 2)
来遍历所有组合.我认为使用dict来存储已比较的对象并动态创建新集是一个好主意,例如:
Compare method works well, but takes a lot of time per execution. I'm currently using itertools.combinations(x, 2)
to iterate through all combinations. I've thought it's a good idea to use a dict for storing already compared objects and create new sets dynamically like:
dct = {'Compared': {}}
dct['Compared'] = set()
import itertools
for a, b in itertools.combinations(x, 2):
if b.name not in dct['Compared']:
if compare(a,b) == True:
#print (a,b)
key = a.name
value = b.name
if key not in dct:
dct[key] = set()
dct[key].add(value)
else:
dct[key].add(value)
dct[key].add(key)
dct['Compared'].add(b)
当前输出:
Compared: {'Object_02', 'Object_01', 'Object_03', 'Object_04', 'Object_05'}
Object_01: {'Object_02', 'Object_03', 'Object_01'}
Object_04: {'Object_05', 'Object_04'}
Object_05: {'Object_04'}
...
我想知道:是否有一种更快的方法来遍历所有组合以及如何中断/阻止已分配给对象的迭代重复列表?
I would like to know: Is there a faster way to iterate through all combinations and how to break/prevent the iteration of an object, which is already assigned to a list of duplicates?
所需的输出:
Compared: {'Object_02', 'Object_01', 'Object_03', 'Object_04', 'Object_05'}
Object_01: {'Object_02', 'Object_03', 'Object_01'}
Object_04: {'Object_05', 'Object_04'}
...
注意:比较方法是c-wrapper.要求是围绕它找到一种算法.
Note: Compare method is a c-wrapper. Requirement is to find an algorithm around it.
推荐答案
您不需要计算所有组合,只需检查给定的项目是否重复:
You don't need to calculate all combinations, you just need to check if a given item is a duplicate:
for i, a in enumerate(x):
if any(a.compare(b) for b in x[:i]):
# a is a duplicate of an already seen item, so do something
从技术上讲,这仍然是O(n ^ 2),但是您至少已经削减了所需的一半支票,并且应该快一些.
This is still technically O(n^2), but you've cut out at least half the checks required, and should be a bit faster.
简而言之,x[:i]
返回索引i
之前列表中的所有项目.如果项目x[i]
出现在该列表中,则说明它是重复项.如果没有,列表中的之后可能会重复,但是到达那里时您会担心.
In short, x[:i]
returns all items in the list before index i
. If the item x[i]
appears in that list, you know it's a duplicate. If not, there may be a duplicate after it in the list, but you worry about that when you get there.
在这里使用any
也很重要:如果找到任何真实项目,它将立即停止,而无需检查其余可迭代项.
Using any
is also important here: if it finds any true item, it will immediately stop, without checking the rest of the iterable.
您还可以通过从要检查的列表中删除已知的重复项来提高检查数量:
You could also improve the number of checks by removing known duplicates from the list you're checking against:
x_copy = x[:]
removed = 0
for i, a in enumerate(x):
if any(a.compare(b) for b in x_copy[:i-removed]):
del x_copy[i-removed]
removed += 1
# a is a duplicate of an already seen item, so do something
请注意,我们使用一个副本,以避免更改要迭代的序列,并且在使用索引时,我们需要考虑已删除的项目数.
Note that we use a copy, to avoid changing the sequence we're iterating over, and we need to take account for the number of items we've removed when using indexes.
接下来,我们只需要弄清楚如何构建字典即可.
Next, we just need to figure out how to build the dictionary.
这可能有点复杂.第一步是准确找出哪个元素是重复项.这可以通过实现any
只是for
循环的包装器来实现:
THis might be a little more complex. The first step is to figure out exactly which element is a duplicate. This can be done by realising any
is just a wrapper around a for
loop:
def any(iterable):
for item in iterable:
if item: return True
return False
然后我们可以进行一些小的更改,并传递一个函数:
We can then make a minor change, and pass in a function:
def first(iterable, fn):
for item in iterable:
if fn(item): return item
return None
现在,我们如下更改重复查找器:
Now, we change our duplicate finder as follows:
d = collections.defaultdict(list)
x_copy = x[:]
removed = 0
for i, a in enumerate(x):
b = first(x_copy[:i-removed], a.compare):
if b is not None:
# b is the first occurring duplicate of a
del x_copy[i-removed]
removed += 1
d[b.name].append(a)
else:
# we've not seen a yet, but might see it later
d[a.name].append(a)
这会将列表中的每个元素放入dict(-like)中.如果您只想要重复项,那么这只是获取所有长度大于1的条目的一种情况.
This will put every element in the list into a dict(-like). If you only want the duplicates, it's then just a case of getting all the entries with a length greater than 1.
这篇关于比较具有自定义功能的唯一对象列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!