比较字典项目时修改字典的有效方法 [英] Efficient way to modify a dictionary while comparing its items

查看:82
本文介绍了比较字典项目时修改字典的有效方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一本以字符串为键并设置为值的字典.这些集合包含整数,这对于字典的不同项目可能是通用的.确实,我的目标是将每个项目与其余项目进行比较,并在设置值中找到那些共同的数字.一旦找到了一对至少在其值集中共享一个数字的项目(比如说(key1,val1)(key2,val2)),我想向字典添加一个额外的项,该项的键是键 key1 key2 的串联(我不在乎订单),并且其值是与 val1 val2 中的公共数字.此外,应该从该项目中删除 val1 val2 中的公用数字,并且在最终的情况下,将这些集合保留为空,则应删除该项目.

I have a dictionary with strings as keys and sets as values. These sets contain integers, which may be common for different items of the dictionary. Indeed, my goal is to compare each item against the rest and find those common numbers in the set values. Once located a pair of items that at least share a number in their value sets (lets say, (key1, val1) and (key2, val2)), I want to add an extra item to the dictionary whose key is the concatenation of both keys key1 and key2 (I DO NOT CARE ABOUT THE ORDER), and whose value is a set with the common numbers in val1 and val2. Additionally, the common numbers in val1 and val2 should be removed from that item, and in the eventual case those sets are left empty, that item should be removed.

长描述,需要示例:

我有一本这样的字典:

db = {"a": {1}, "b":{1}, "c":{2,3,4}, "d":{2,3}}

我正在寻找一个返回如下内容的过程:

I am looking for a procedure that returns something like this:

{"ab": {1}, "c":{4}, "cd": {2,3}}

我实际上有一个解决方案,但是我怀疑效率很低,尤其是因为我开始为键和字典的值创建两个列表,以便能够进行嵌套循环.也就是说,

I actually have a solution, but I suspect is quite inefficient, especially because I start creating two lists for the keys and the values of the dictionary, in order to be able to do a nested loop. That is,

db = {"a":{1}, "b":{1}, "c":{2,3,4}, "d":{2,3}}

keys = list(db.keys())
values = list(db.values())
blackList = set() #I plan to add here the indexes to be removed
for i, key in enumerate(keys):
    if i in blackList: continue
    keysToAdd = []
    valuesToAdd = []
    for j in range(i+1,len(keys)):
        if j in blackList: continue
        
        intersect = values[i].intersection(values[j])

        if len(intersect)==0: continue
        
        keysToAdd.append(key+keys[j]) #I don't care about the order
        valuesToAdd.append(intersect)

        values[i] -= intersect
        values[j] -= intersect

        if len(values[j])==0:
            blackList.add(j)

        if len(values[i])==0:
            blackList.add(i)
            break
    #out of j loop
    keys.extend(keysToAdd)
    values.extend(valuesToAdd)

#finally delete elements in the blackList
for i in sorted(blackList, reverse=True):
    del keys[i]
    del values[i]

print(dict(zip(keys, values))) #{"ab": {1}, "c":{4}, "cd":{2,3}} #I don't care about the order

我想知道我的方法是否可以改进,以便有一个更有效的解决方案(因为我计划有庞大而复杂的输入指令).它实际上是使用 itertools.combinations 的一种方式吗?-也许是一种完全不同的方法来解决此问题.预先感谢

I would like to know if my approach can be improved, in order to have a more efficient solution (because I plan to have big and complex input dicts). Is it actually a way of using itertools.combinations? - Or perhaps a completely different approach to solve this problem. Thanks in advance

推荐答案

我认为您的方法并非效率太低,但是至少有一种更快的方法可以实现这一目标.对于 defaultdict ,我认为这是一项不错的工作..>来自集合的

I think your approach is not overly inefficient, but there is at least one slightly faster way to achieve this. I think this is a good job for defaultdict.

from collections import defaultdict

def deduplicate(db: dict) -> dict:
    # Collect all keys which each number appears in by inverting the mapping:
    inverse = defaultdict(set)  # number -> keys
    for key, value in db.items():
        for number in value:
            inverse[number].add(key)

    # Invert the mapping again - this time joining the keys:
    result = defaultdict(set)  # joined key -> numbers
    for number, keys in inverse.items():
        new_key = "".join(sorted(keys))
        result[new_key].add(number)
    return dict(result)

使用示例进行比较时,速度大约快6倍.我毫不怀疑有更快的方法来实现它!

When comparing using your example - this is roughly 6 times faster. I don't doubt there are faster ways to achieve it!

使用 itertools.combinations 将是一种合理的方法.由于这可能会更大-您需要迭代各种大小的组合,最终导致整体迭代更多.

Using itertools.combinations would be a reasonable approach if for example you knew that each number could show up in at most 2 keys. As this can presumably be larger - you would need to iterate combinations of varying sizes, ending up with more iterations overall.

注释1 :在加入新密钥之前,我已经对密钥进行了排序.您提到您不介意顺序-但是在此解决方案中,键的一致性很重要,以避免同时存在"abc" "bac" 用不同的数字.一致的顺序还可以使您更轻松地进行测试.

Note 1: I have sorted the keys before joining into a new key. You mention that you don't mind about the order - but in this solution its important that keys are consistent, to avoid both "abc" and "bac" existing with different numbers. The consistent order will also make it easier for you to test.

注释2 :从黑名单中删除值的步骤会对输入 db 产生副作用-因此,使用相同的输入 db 会产生不同的结果.您可以通过执行 deepcopy :

Note 2: the step where you remove values from the blacklist has a side effect on the input db - and so running your code twice with the same input db would produce different results. You could avoid that in your original implementation by performing a deepcopy:

values = deepcopy(list(db.values())

这篇关于比较字典项目时修改字典的有效方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆