在词典中找到混合类型值的重复项 [英] Find duplicates for mixed type values in dictionaries
问题描述
我想在字典中识别和分组重复值。为了做到这一点,我建立一个伪数据(更好地阅读我们的数据集),如下所示:
I would like to recognize and group duplicates values in a dictionary. To do this I build a pseudo-hash (better read signature) of my data set as follow:
from pickle import dumps
taxonomy = {}
binder = defaultdict(list)
for key, value in ds.items():
signature = dumps(value)
taxonomy[signature] = value
binder[signature].append(key)
有关具体用例,请参阅此问题
For a concrete use-case see this question.
不幸的是,我意识到如果以下语句是 True
:
Unfortunately I realized that if the following statement is True
:
>>> ds['key1'] == ds['key2']
True
不总是 True
>>> dumps(ds['key1']) == dumps(ds['key2'])
False
$ b $我注意到倾销产出的关键顺序对于两者都不同。如果我 / 粘贴,则输出 ds ['key1']
和 ds [ 'key2']
进入新的字典我可以使比较成功。
I notice the key order on the dumped output differ for both dict. If I copy/paste the output of ds['key1']
and ds['key2']
into new dictionaries I can make the comparison successful.
作为一个过分的替代方法,我可以递归地遍历我的数据集,并用 OrderedDict替换
: dict
As an overkill alternative I could traverse my dataset recursively and replace dict
instances with OrderedDict
:
import copy
def faithfulrepr(od):
od = od.deepcopy(od)
if isinstance(od, collections.Mapping):
res = collections.OrderedDict()
for k, v in sorted(od.items()):
res[k] = faithfulrepr(v)
return repr(res)
if isinstance(od, list):
for i, v in enumerate(od):
od[i] = faithfulrepr(v)
return repr(od)
return repr(od)
>>> faithfulrepr(ds['key1']) == faithfulrepr(ds['key2'])
True
$ b $我很担心这个天真的做法,因为我不知道我是否涵盖了所有可能的情况。
I am worried about this naive approach because I do not know whether I cover all the possible situations.
我可以使用什么其他(通用)替代方案?
What other (generic) alternative can I use?
推荐答案
首先是删除对这个瓶颈的 deepcopy
的调用:
The first thing is to remove the call to deepcopy
which is your bottleneck here:
def faithfulrepr(ds):
if isinstance(ds, collections.Mapping):
res = collections.OrderedDict(
(k, faithfulrepr(v)) for k, v in sorted(ds.items())
)
elif isinstance(ds, list):
res = [faithfulrepr(v) for v in ds]
else:
res = ds
return repr(res)
然而
和 repr
有其缺点:
- 你可以不要使用与不同类型密钥的映射
所以第二件事是摆脱 faithfulrepr
并将对象与 __ eq __
:
So the second thing is to get rid of faithfulrepr
and compare objects with __eq__
:
binder, values = [], []
for key, value in ds.items():
try:
index = values.index(value)
except ValueError:
values.append(value)
binder.append([key])
else:
binder[index].append(key)
grouped = dict(zip(map(tuple, binder), values))
这篇关于在词典中找到混合类型值的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!