将类似的dict条目组合为键元组 [英] Group similar dict entries as a tuple of keys

查看:118
本文介绍了将类似的dict条目组合为键元组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想对数据集的类似条目进行分组。

I would like to group similar entries of a dataset.

ds = {1: 'foo',
      2: 'bar',
      3: 'foo',
      4: 'bar',
      5: 'foo'}

>>>tupelize_dict(ds)
{
   (1,3,5): 'foo',
   (2,4): 'bar'
}

我写了这个函数,但我确定有一些简单的方法,不是吗?

I wrote this function, but I am sure there is something way simpler, isn't?

def tupelize_dict(data):
    from itertools import chain, combinations
    while True:
        rounds = []
        for x in combinations(data.keys(), 2):
            rounds.append((x, data[x[0]], data[x[1]]))

        end = True
        for k, a, b in rounds:
            if a == b:
                k_chain = [x if isinstance(x, (tuple, list)) else [x] for x in k]
                data[tuple(sorted(chain.from_iterable(k_chain)))] = a
                [data.pop(r) for r in k]
                end = False
                break
        if end:
            break
    return data  

编辑

我对一般情况感兴趣,其中数据集的内容可以是任何类型的对象允许 ds [i] == ds [j]

I am interested in the general case where the content of the dataset can be any type of object that allows ds[i] == ds[j]:

ds = {1: {'a': {'b':'c'}},
      2: 'bar',
      3: {'a': {'b':'c'}},
      4: 'bar',
      5: {'a': {'b':'c'}}}


推荐答案

根据acushner ,如果我可以计算数据集元素的内容的哈希值,可以使其工作。

Following the answer of acushner, it is possible to make it work if I can compute a hash of the content of dataset's elements.

import pickle
from collections import defaultdict

def tupelize_dict(ds):
    t = {}
    d = defaultdict(list)
    for k, v in ds.items():
        h = dumps(ds)
        t[h] = v
        d[h].append(k)

    return {tuple(v): t[k] for k, v in d.items()}   

该解决方案比我原来的命题快得多。

This solution is MUCH faster than my original proposition.

为了测试它,我做了一套大的随机嵌套字典,并运行 cProfile 在两个实现中:

To test it I made a set of big random nested dictionary and run cProfile on both implementations:

original: 204.9 seconds
new:        6.4 seconds

编辑:

我意识到转储不适用于某些字典,因为密钥顺序可能因为模糊原因而内部变化(请参阅此问题

I realized the dumps does not work with some dictionaries because the keys order can internally vary for obscure reasons (see this question)

解决办法是订购所有这个dict:

A workaround would be to order all the dicts:

import copy
import collections

def faithfulrepr(od):
    od = od.deepcopy(od)
    if isinstance(od, collections.Mapping):
        res = collections.OrderedDict()
        for k, v in sorted(od.items()):
            res[k] = faithfulrepr(v)
        return repr(res)
    if isinstance(od, list):
        for i, v in enumerate(od):
            od[i] = faithfulrepr(v)
        return repr(od)
    return repr(od)

def tupelize_dict(ds):
    taxonomy = {}
    binder = collections.defaultdict(list)
    for key, value in ds.items():
        signature = faithfulrepr(value)
        taxonomy[signature] = value
        binder[signature].append(key)
    def tu(keys):
        return tuple(sorted(keys)) if len(keys) > 1 else keys[0]
    return {tu(keys): taxonomy[s] for s, keys in binder.items()}   

这篇关于将类似的dict条目组合为键元组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆